Parent Directory
|
Revision Log
|
Patch
revision 358 by ph10, Wed Jul 9 11:03:07 2008 UTC | revision 453 by ph10, Fri Sep 18 19:12:35 2009 UTC | |
---|---|---|
# | Line 2 | Line 2 |
2 | This file contains a concatenation of the PCRE man pages, converted to plain | This file contains a concatenation of the PCRE man pages, converted to plain |
3 | text format for ease of searching with a text editor, or for use on systems | text format for ease of searching with a text editor, or for use on systems |
4 | that do not have a man page processor. The small individual files that give | that do not have a man page processor. The small individual files that give |
5 | synopses of each function in the library have not been included. There are | synopses of each function in the library have not been included. Neither has |
6 | separate text files for the pcregrep and pcretest commands. | the pcredemo program. There are separate text files for the pcregrep and |
7 | pcretest commands. | |
8 | ----------------------------------------------------------------------------- | ----------------------------------------------------------------------------- |
9 | ||
10 | ||
# | Line 24 INTRODUCTION | Line 25 INTRODUCTION |
25 | tax items, and there is an option for requesting some minor changes | tax items, and there is an option for requesting some minor changes |
26 | that give better JavaScript compatibility. | that give better JavaScript compatibility. |
27 | ||
28 | The current implementation of PCRE (release 7.x) corresponds approxi- | The current implementation of PCRE (release 8.xx) corresponds approxi- |
29 | mately with Perl 5.10, including support for UTF-8 encoded strings and | mately with Perl 5.10, including support for UTF-8 encoded strings and |
30 | Unicode general category properties. However, UTF-8 and Unicode support | Unicode general category properties. However, UTF-8 and Unicode support |
31 | has to be explicitly enabled; it is not the default. The Unicode tables | has to be explicitly enabled; it is not the default. The Unicode tables |
32 | correspond to Unicode release 5.0.0. | correspond to Unicode release 5.1. |
33 | ||
34 | In addition to the Perl-compatible matching function, PCRE contains an | In addition to the Perl-compatible matching function, PCRE contains an |
35 | alternative matching function that matches the same compiled patterns | alternative matching function that matches the same compiled patterns |
# | Line 71 USER DOCUMENTATION | Line 72 USER DOCUMENTATION |
72 | The user documentation for PCRE comprises a number of different sec- | The user documentation for PCRE comprises a number of different sec- |
73 | tions. In the "man" format, each of these is a separate "man page". In | tions. In the "man" format, each of these is a separate "man page". In |
74 | the HTML format, each is a separate page, linked from the index page. | the HTML format, each is a separate page, linked from the index page. |
75 | In the plain text format, all the sections are concatenated, for ease | In the plain text format, all the sections, except the pcredemo sec- |
76 | of searching. The sections are as follows: | tion, are concatenated, for ease of searching. The sections are as fol- |
77 | lows: | |
78 | ||
79 | pcre this document | pcre this document |
80 | pcre-config show PCRE installation configuration information | pcre-config show PCRE installation configuration information |
# | Line 81 USER DOCUMENTATION | Line 83 USER DOCUMENTATION |
83 | pcrecallout details of the callout feature | pcrecallout details of the callout feature |
84 | pcrecompat discussion of Perl compatibility | pcrecompat discussion of Perl compatibility |
85 | pcrecpp details of the C++ wrapper | pcrecpp details of the C++ wrapper |
86 | pcredemo a demonstration C program that uses PCRE | |
87 | pcregrep description of the pcregrep command | pcregrep description of the pcregrep command |
88 | pcrematching discussion of the two matching algorithms | pcrematching discussion of the two matching algorithms |
89 | pcrepartial details of the partial matching facility | pcrepartial details of the partial matching facility |
# | Line 90 USER DOCUMENTATION | Line 93 USER DOCUMENTATION |
93 | pcreperform discussion of performance issues | pcreperform discussion of performance issues |
94 | pcreposix the POSIX-compatible C API | pcreposix the POSIX-compatible C API |
95 | pcreprecompile details of saving and re-using precompiled patterns | pcreprecompile details of saving and re-using precompiled patterns |
96 | pcresample discussion of the sample program | pcresample discussion of the pcredemo program |
97 | pcrestack discussion of stack usage | pcrestack discussion of stack usage |
98 | pcretest description of the pcretest testing command | pcretest description of the pcretest testing command |
99 | ||
# | Line 136 UTF-8 AND UNICODE PROPERTY SUPPORT | Line 139 UTF-8 AND UNICODE PROPERTY SUPPORT |
139 | ||
140 | In order process UTF-8 strings, you must build PCRE to include UTF-8 | In order process UTF-8 strings, you must build PCRE to include UTF-8 |
141 | support in the code, and, in addition, you must call pcre_compile() | support in the code, and, in addition, you must call pcre_compile() |
142 | with the PCRE_UTF8 option flag. When you do this, both the pattern and | with the PCRE_UTF8 option flag, or the pattern must start with the |
143 | any subject strings that are matched against it are treated as UTF-8 | sequence (*UTF8). When either of these is the case, both the pattern |
144 | strings instead of just strings of bytes. | and any subject strings that are matched against it are treated as |
145 | UTF-8 strings instead of just strings of bytes. | |
146 | ||
147 | If you compile PCRE with UTF-8 support, but do not use it at run time, | If you compile PCRE with UTF-8 support, but do not use it at run time, |
148 | the library will be a bit bigger, but the additional run time overhead | the library will be a bit bigger, but the additional run time overhead |
149 | is limited to testing the PCRE_UTF8 flag occasionally, so should not be | is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
150 | very big. | very big. |
151 | ||
152 | If PCRE is built with Unicode character property support (which implies | If PCRE is built with Unicode character property support (which implies |
153 | UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- | UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
154 | ported. The available properties that can be tested are limited to the | ported. The available properties that can be tested are limited to the |
155 | general category properties such as Lu for an upper case letter or Nd | general category properties such as Lu for an upper case letter or Nd |
156 | for a decimal number, the Unicode script names such as Arabic or Han, | for a decimal number, the Unicode script names such as Arabic or Han, |
157 | and the derived properties Any and L&. A full list is given in the | and the derived properties Any and L&. A full list is given in the |
158 | pcrepattern documentation. Only the short names for properties are sup- | pcrepattern documentation. Only the short names for properties are sup- |
159 | ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- | ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
160 | ter}, is not supported. Furthermore, in Perl, many properties may | ter}, is not supported. Furthermore, in Perl, many properties may |
161 | optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE | optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
162 | does not support this. | does not support this. |
163 | ||
164 | Validity of UTF-8 strings | Validity of UTF-8 strings |
165 | ||
166 | When you set the PCRE_UTF8 flag, the strings passed as patterns and | When you set the PCRE_UTF8 flag, the strings passed as patterns and |
167 | subjects are (by default) checked for validity on entry to the relevant | subjects are (by default) checked for validity on entry to the relevant |
168 | functions. From release 7.3 of PCRE, the check is according the rules | functions. From release 7.3 of PCRE, the check is according the rules |
169 | of RFC 3629, which are themselves derived from the Unicode specifica- | of RFC 3629, which are themselves derived from the Unicode specifica- |
170 | tion. Earlier releases of PCRE followed the rules of RFC 2279, which | tion. Earlier releases of PCRE followed the rules of RFC 2279, which |
171 | allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current | allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current |
172 | check allows only values in the range U+0 to U+10FFFF, excluding U+D800 | check allows only values in the range U+0 to U+10FFFF, excluding U+D800 |
173 | to U+DFFF. | to U+DFFF. |
174 | ||
175 | The excluded code points are the "Low Surrogate Area" of Unicode, of | The excluded code points are the "Low Surrogate Area" of Unicode, of |
176 | which the Unicode Standard says this: "The Low Surrogate Area does not | which the Unicode Standard says this: "The Low Surrogate Area does not |
177 | contain any character assignments, consequently no character code | contain any character assignments, consequently no character code |
178 | charts or namelists are provided for this area. Surrogates are reserved | charts or namelists are provided for this area. Surrogates are reserved |
179 | for use with UTF-16 and then must be used in pairs." The code points | for use with UTF-16 and then must be used in pairs." The code points |
180 | that are encoded by UTF-16 pairs are available as independent code | that are encoded by UTF-16 pairs are available as independent code |
181 | points in the UTF-8 encoding. (In other words, the whole surrogate | points in the UTF-8 encoding. (In other words, the whole surrogate |
182 | thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) | thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
183 | ||
184 | If an invalid UTF-8 string is passed to PCRE, an error return | If an invalid UTF-8 string is passed to PCRE, an error return |
185 | (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know | (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know |
186 | that your strings are valid, and therefore want to skip these checks in | that your strings are valid, and therefore want to skip these checks in |
187 | order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at | order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at |
188 | compile time or at run time, PCRE assumes that the pattern or subject | compile time or at run time, PCRE assumes that the pattern or subject |
189 | it is given (respectively) contains only valid UTF-8 codes. In this | it is given (respectively) contains only valid UTF-8 codes. In this |
190 | case, it does not diagnose an invalid UTF-8 string. | case, it does not diagnose an invalid UTF-8 string. |
191 | ||
192 | If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, | If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, |
193 | what happens depends on why the string is invalid. If the string con- | what happens depends on why the string is invalid. If the string con- |
194 | forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a | forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a |
195 | string of characters in the range 0 to 0x7FFFFFFF. In other words, | string of characters in the range 0 to 0x7FFFFFFF. In other words, |
196 | apart from the initial validity test, PCRE (when in UTF-8 mode) handles | apart from the initial validity test, PCRE (when in UTF-8 mode) handles |
197 | strings according to the more liberal rules of RFC 2279. However, if | strings according to the more liberal rules of RFC 2279. However, if |
198 | the string does not even conform to RFC 2279, the result is undefined. | the string does not even conform to RFC 2279, the result is undefined. |
199 | Your program may crash. | Your program may crash. |
200 | ||
201 | If you want to process strings of values in the full range 0 to | If you want to process strings of values in the full range 0 to |
202 | 0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can | 0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can |
203 | set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in | set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in |
204 | this situation, you will have to apply your own validity check. | this situation, you will have to apply your own validity check. |
205 | ||
206 | General comments about UTF-8 mode | General comments about UTF-8 mode |
207 | ||
208 | 1. An unbraced hexadecimal escape sequence (such as \xb3) matches a | 1. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
209 | two-byte UTF-8 character if the value is greater than 127. | two-byte UTF-8 character if the value is greater than 127. |
210 | ||
211 | 2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 | 2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
212 | characters for values greater than \177. | characters for values greater than \177. |
213 | ||
214 | 3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- | 3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
215 | vidual bytes, for example: \x{100}{3}. | vidual bytes, for example: \x{100}{3}. |
216 | ||
217 | 4. The dot metacharacter matches one UTF-8 character instead of a sin- | 4. The dot metacharacter matches one UTF-8 character instead of a sin- |
218 | gle byte. | gle byte. |
219 | ||
220 | 5. The escape sequence \C can be used to match a single byte in UTF-8 | 5. The escape sequence \C can be used to match a single byte in UTF-8 |
221 | mode, but its use can lead to some strange effects. This facility is | mode, but its use can lead to some strange effects. This facility is |
222 | not available in the alternative matching function, pcre_dfa_exec(). | not available in the alternative matching function, pcre_dfa_exec(). |
223 | ||
224 | 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly | 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
225 | test characters of any code value, but the characters that PCRE recog- | test characters of any code value, but the characters that PCRE recog- |
226 | nizes as digits, spaces, or word characters remain the same set as | nizes as digits, spaces, or word characters remain the same set as |
227 | before, all with values less than 256. This remains true even when PCRE | before, all with values less than 256. This remains true even when PCRE |
228 | includes Unicode property support, because to do otherwise would slow | includes Unicode property support, because to do otherwise would slow |
229 | down PCRE in many common cases. If you really want to test for a wider | down PCRE in many common cases. If you really want to test for a wider |
230 | sense of, say, "digit", you must use Unicode property tests such as | sense of, say, "digit", you must use Unicode property tests such as |
231 | \p{Nd}. | \p{Nd}. Note that this also applies to \b, because it is defined in |
232 | terms of \w and \W. | |
233 | ||
234 | 7. Similarly, characters that match the POSIX named character classes | 7. Similarly, characters that match the POSIX named character classes |
235 | are all low-valued characters. | are all low-valued characters. |
# | Line 258 AUTHOR | Line 263 AUTHOR |
263 | ||
264 | REVISION | REVISION |
265 | ||
266 | Last updated: 12 April 2008 | Last updated: 01 September 2009 |
267 | Copyright (c) 1997-2008 University of Cambridge. | Copyright (c) 1997-2009 University of Cambridge. |
268 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
269 | ||
270 | ||
271 | PCREBUILD(3) PCREBUILD(3) | PCREBUILD(3) PCREBUILD(3) |
272 | ||
273 | ||
# | Line 277 PCRE BUILD-TIME OPTIONS | Line 282 PCRE BUILD-TIME OPTIONS |
282 | script, where the optional features are selected or deselected by pro- | script, where the optional features are selected or deselected by pro- |
283 | viding options to configure before running the make command. However, | viding options to configure before running the make command. However, |
284 | the same options can be selected in both Unix-like and non-Unix-like | the same options can be selected in both Unix-like and non-Unix-like |
285 | environments using the GUI facility of CMakeSetup if you are using | environments using the GUI facility of cmake-gui if you are using CMake |
286 | CMake instead of configure to build PCRE. | instead of configure to build PCRE. |
287 | ||
288 | There is a lot more information about building PCRE in non-Unix-like | |
289 | environments in the file called NON_UNIX_USE, which is part of the PCRE | |
290 | distribution. You should consult this file as well as the README file | |
291 | if you are building in a non-Unix-like environment. | |
292 | ||
293 | The complete list of options for configure (which includes the standard | The complete list of options for configure (which includes the standard |
294 | ones such as the selection of the installation directory) can be | ones such as the selection of the installation directory) can be |
295 | obtained by running | obtained by running |
296 | ||
297 | ./configure --help | ./configure --help |
298 | ||
299 | The following sections include descriptions of options whose names | The following sections include descriptions of options whose names |
300 | begin with --enable or --disable. These settings specify changes to the | begin with --enable or --disable. These settings specify changes to the |
301 | defaults for the configure command. Because of the way that configure | defaults for the configure command. Because of the way that configure |
302 | works, --enable and --disable always come in pairs, so the complemen- | works, --enable and --disable always come in pairs, so the complemen- |
303 | tary option always exists as well, but as it specifies the default, it | tary option always exists as well, but as it specifies the default, it |
304 | is not described. | is not described. |
305 | ||
306 | ||
# | Line 307 C++ SUPPORT | Line 317 C++ SUPPORT |
317 | ||
318 | UTF-8 SUPPORT | UTF-8 SUPPORT |
319 | ||
320 | To build PCRE with support for UTF-8 character strings, add | To build PCRE with support for UTF-8 Unicode character strings, add |
321 | ||
322 | --enable-utf8 | --enable-utf8 |
323 | ||
324 | to the configure command. Of itself, this does not make PCRE treat | to the configure command. Of itself, this does not make PCRE treat |
325 | strings as UTF-8. As well as compiling PCRE with this option, you also | strings as UTF-8. As well as compiling PCRE with this option, you also |
326 | have have to set the PCRE_UTF8 option when you call the pcre_compile() | have have to set the PCRE_UTF8 option when you call the pcre_compile() |
327 | function. | function. |
328 | ||
329 | If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE | |
330 | expects its input to be either ASCII or UTF-8 (depending on the runtime | |
331 | option). It is not possible to support both EBCDIC and UTF-8 codes in | |
332 | the same version of the library. Consequently, --enable-utf8 and | |
333 | --enable-ebcdic are mutually exclusive. | |
334 | ||
335 | ||
336 | UNICODE CHARACTER PROPERTY SUPPORT | UNICODE CHARACTER PROPERTY SUPPORT |
337 | ||
338 | UTF-8 support allows PCRE to process character values greater than 255 | UTF-8 support allows PCRE to process character values greater than 255 |
339 | in the strings that it handles. On its own, however, it does not pro- | in the strings that it handles. On its own, however, it does not pro- |
340 | vide any facilities for accessing the properties of such characters. If | vide any facilities for accessing the properties of such characters. If |
341 | you want to be able to use the pattern escapes \P, \p, and \X, which | you want to be able to use the pattern escapes \P, \p, and \X, which |
342 | refer to Unicode character properties, you must add | refer to Unicode character properties, you must add |
343 | ||
344 | --enable-unicode-properties | --enable-unicode-properties |
345 | ||
346 | to the configure command. This implies UTF-8 support, even if you have | to the configure command. This implies UTF-8 support, even if you have |
347 | not explicitly requested it. | not explicitly requested it. |
348 | ||
349 | Including Unicode property support adds around 30K of tables to the | Including Unicode property support adds around 30K of tables to the |
350 | PCRE library. Only the general category properties such as Lu and Nd | PCRE library. Only the general category properties such as Lu and Nd |
351 | are supported. Details are given in the pcrepattern documentation. | are supported. Details are given in the pcrepattern documentation. |
352 | ||
353 | ||
354 | CODE VALUE OF NEWLINE | CODE VALUE OF NEWLINE |
355 | ||
356 | By default, PCRE interprets character 10 (linefeed, LF) as indicating | By default, PCRE interprets the linefeed (LF) character as indicating |
357 | the end of a line. This is the normal newline character on Unix-like | the end of a line. This is the normal newline character on Unix-like |
358 | systems. You can compile PCRE to use character 13 (carriage return, CR) | systems. You can compile PCRE to use carriage return (CR) instead, by |
359 | instead, by adding | adding |
360 | ||
361 | --enable-newline-is-cr | --enable-newline-is-cr |
362 | ||
363 | to the configure command. There is also a --enable-newline-is-lf | to the configure command. There is also a --enable-newline-is-lf |
364 | option, which explicitly specifies linefeed as the newline character. | option, which explicitly specifies linefeed as the newline character. |
365 | ||
366 | Alternatively, you can specify that line endings are to be indicated by | Alternatively, you can specify that line endings are to be indicated by |
# | Line 356 CODE VALUE OF NEWLINE | Line 372 CODE VALUE OF NEWLINE |
372 | ||
373 | --enable-newline-is-anycrlf | --enable-newline-is-anycrlf |
374 | ||
375 | which causes PCRE to recognize any of the three sequences CR, LF, or | which causes PCRE to recognize any of the three sequences CR, LF, or |
376 | CRLF as indicating a line ending. Finally, a fifth option, specified by | CRLF as indicating a line ending. Finally, a fifth option, specified by |
377 | ||
378 | --enable-newline-is-any | --enable-newline-is-any |
# | Line 516 USING EBCDIC CODE | Line 532 USING EBCDIC CODE |
532 | ||
533 | to the configure command. This setting implies --enable-rebuild-charta- | to the configure command. This setting implies --enable-rebuild-charta- |
534 | bles. You should only use it if you know that you are in an EBCDIC | bles. You should only use it if you know that you are in an EBCDIC |
535 | environment (for example, an IBM mainframe operating system). | environment (for example, an IBM mainframe operating system). The |
536 | --enable-ebcdic option is incompatible with --enable-utf8. | |
537 | ||
538 | ||
539 | PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT | PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT |
# | Line 529 PCREGREP OPTIONS FOR COMPRESSED FILE SUP | Line 546 PCREGREP OPTIONS FOR COMPRESSED FILE SUP |
546 | --enable-pcregrep-libbz2 | --enable-pcregrep-libbz2 |
547 | ||
548 | to the configure command. These options naturally require that the rel- | to the configure command. These options naturally require that the rel- |
549 | evant libraries are installed on your system. Configuration will fail | evant libraries are installed on your system. Configuration will fail |
550 | if they are not. | if they are not. |
551 | ||
552 | ||
# | Line 539 PCRETEST OPTION FOR LIBREADLINE SUPPORT | Line 556 PCRETEST OPTION FOR LIBREADLINE SUPPORT |
556 | ||
557 | --enable-pcretest-libreadline | --enable-pcretest-libreadline |
558 | ||
559 | to the configure command, pcretest is linked with the libreadline | to the configure command, pcretest is linked with the libreadline |
560 | library, and when its input is from a terminal, it reads it using the | library, and when its input is from a terminal, it reads it using the |
561 | readline() function. This provides line-editing and history facilities. | readline() function. This provides line-editing and history facilities. |
562 | Note that libreadline is GPL-licenced, so if you distribute a binary of | Note that libreadline is GPL-licenced, so if you distribute a binary of |
563 | pcretest linked in this way, there may be licensing issues. | pcretest linked in this way, there may be licensing issues. |
564 | ||
565 | Setting this option causes the -lreadline option to be added to the | Setting this option causes the -lreadline option to be added to the |
566 | pcretest build. In many operating environments with a sytem-installed | pcretest build. In many operating environments with a sytem-installed |
567 | libreadline this is sufficient. However, in some environments (e.g. if | libreadline this is sufficient. However, in some environments (e.g. if |
568 | an unmodified distribution version of readline is in use), some extra | an unmodified distribution version of readline is in use), some extra |
569 | configuration may be necessary. The INSTALL file for libreadline says | configuration may be necessary. The INSTALL file for libreadline says |
570 | this: | this: |
571 | ||
572 | "Readline uses the termcap functions, but does not link with the | "Readline uses the termcap functions, but does not link with the |
573 | termcap or curses library itself, allowing applications which link | termcap or curses library itself, allowing applications which link |
574 | with readline the to choose an appropriate library." | with readline the to choose an appropriate library." |
575 | ||
576 | If your environment has not been set up so that an appropriate library | If your environment has not been set up so that an appropriate library |
577 | is automatically included, you may need to add something like | is automatically included, you may need to add something like |
578 | ||
579 | LIBS="-ncurses" | LIBS="-ncurses" |
# | Line 578 AUTHOR | Line 595 AUTHOR |
595 | ||
596 | REVISION | REVISION |
597 | ||
598 | Last updated: 13 April 2008 | Last updated: 06 September 2009 |
599 | Copyright (c) 1997-2008 University of Cambridge. | Copyright (c) 1997-2009 University of Cambridge. |
600 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
601 | ||
602 | ||
603 | PCREMATCHING(3) PCREMATCHING(3) | PCREMATCHING(3) PCREMATCHING(3) |
604 | ||
605 | ||
# | Line 684 THE ALTERNATIVE MATCHING ALGORITHM | Line 701 THE ALTERNATIVE MATCHING ALGORITHM |
701 | at the fourth character of the subject. The algorithm does not automat- | at the fourth character of the subject. The algorithm does not automat- |
702 | ically move on to find matches that start at later positions. | ically move on to find matches that start at later positions. |
703 | ||
704 | Although the general principle of this matching algorithm is that it | |
705 | scans the subject string only once, without backtracking, there is one | |
706 | exception: when a lookbehind assertion is encountered, the preceding | |
707 | characters have to be re-inspected. | |
708 | ||
709 | There are a number of features of PCRE regular expressions that are not | There are a number of features of PCRE regular expressions that are not |
710 | supported by the alternative matching algorithm. They are as follows: | supported by the alternative matching algorithm. They are as follows: |
711 | ||
712 | 1. Because the algorithm finds all possible matches, the greedy or | 1. Because the algorithm finds all possible matches, the greedy or |
713 | ungreedy nature of repetition quantifiers is not relevant. Greedy and | ungreedy nature of repetition quantifiers is not relevant. Greedy and |
714 | ungreedy quantifiers are treated in exactly the same way. However, pos- | ungreedy quantifiers are treated in exactly the same way. However, pos- |
715 | sessive quantifiers can make a difference when what follows could also | sessive quantifiers can make a difference when what follows could also |
716 | match what is quantified, for example in a pattern like this: | match what is quantified, for example in a pattern like this: |
717 | ||
718 | ^a++\w! | ^a++\w! |
719 | ||
720 | This pattern matches "aaab!" but not "aaa!", which would be matched by | This pattern matches "aaab!" but not "aaa!", which would be matched by |
721 | a non-possessive quantifier. Similarly, if an atomic group is present, | a non-possessive quantifier. Similarly, if an atomic group is present, |
722 | it is matched as if it were a standalone pattern at the current point, | it is matched as if it were a standalone pattern at the current point, |
723 | and the longest match is then "locked in" for the rest of the overall | and the longest match is then "locked in" for the rest of the overall |
724 | pattern. | pattern. |
725 | ||
726 | 2. When dealing with multiple paths through the tree simultaneously, it | 2. When dealing with multiple paths through the tree simultaneously, it |
727 | is not straightforward to keep track of captured substrings for the | is not straightforward to keep track of captured substrings for the |
728 | different matching possibilities, and PCRE's implementation of this | different matching possibilities, and PCRE's implementation of this |
729 | algorithm does not attempt to do this. This means that no captured sub- | algorithm does not attempt to do this. This means that no captured sub- |
730 | strings are available. | strings are available. |
731 | ||
732 | 3. Because no substrings are captured, back references within the pat- | 3. Because no substrings are captured, back references within the pat- |
733 | tern are not supported, and cause errors if encountered. | tern are not supported, and cause errors if encountered. |
734 | ||
735 | 4. For the same reason, conditional expressions that use a backrefer- | 4. For the same reason, conditional expressions that use a backrefer- |
736 | ence as the condition or test for a specific group recursion are not | ence as the condition or test for a specific group recursion are not |
737 | supported. | supported. |
738 | ||
739 | 5. Because many paths through the tree may be active, the \K escape | 5. Because many paths through the tree may be active, the \K escape |
740 | sequence, which resets the start of the match when encountered (but may | sequence, which resets the start of the match when encountered (but may |
741 | be on some paths and not on others), is not supported. It causes an | be on some paths and not on others), is not supported. It causes an |
742 | error if encountered. | error if encountered. |
743 | ||
744 | 6. Callouts are supported, but the value of the capture_top field is | 6. Callouts are supported, but the value of the capture_top field is |
745 | always 1, and the value of the capture_last field is always -1. | always 1, and the value of the capture_last field is always -1. |
746 | ||
747 | 7. The \C escape sequence, which (in the standard algorithm) matches a | 7. The \C escape sequence, which (in the standard algorithm) matches a |
748 | single byte, even in UTF-8 mode, is not supported because the alterna- | single byte, even in UTF-8 mode, is not supported because the alterna- |
749 | tive algorithm moves through the subject string one character at a | tive algorithm moves through the subject string one character at a |
750 | time, for all active paths through the tree. | time, for all active paths through the tree. |
751 | ||
752 | 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) | 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
753 | are not supported. (*FAIL) is supported, and behaves like a failing | are not supported. (*FAIL) is supported, and behaves like a failing |
754 | negative assertion. | negative assertion. |
755 | ||
756 | ||
757 | ADVANTAGES OF THE ALTERNATIVE ALGORITHM | ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
758 | ||
759 | Using the alternative matching algorithm provides the following advan- | Using the alternative matching algorithm provides the following advan- |
760 | tages: | tages: |
761 | ||
762 | 1. All possible matches (at a single point in the subject) are automat- | 1. All possible matches (at a single point in the subject) are automat- |
763 | ically found, and in particular, the longest match is found. To find | ically found, and in particular, the longest match is found. To find |
764 | more than one match using the standard algorithm, you have to do kludgy | more than one match using the standard algorithm, you have to do kludgy |
765 | things with callouts. | things with callouts. |
766 | ||
767 | 2. There is much better support for partial matching. The restrictions | 2. Because the alternative algorithm scans the subject string just |
768 | on the content of the pattern that apply when using the standard algo- | once, and never needs to backtrack, it is possible to pass very long |
769 | rithm for partial matching do not apply to the alternative algorithm. | subject strings to the matching function in several pieces, checking |
For non-anchored patterns, the starting position of a partial match is | ||
available. | ||
3. Because the alternative algorithm scans the subject string just | ||
once, and never needs to backtrack, it is possible to pass very long | ||
subject strings to the matching function in several pieces, checking | ||
770 | for partial matching each time. | for partial matching each time. |
771 | ||
772 | ||
# | Line 758 DISADVANTAGES OF THE ALTERNATIVE ALGORIT | Line 774 DISADVANTAGES OF THE ALTERNATIVE ALGORIT |
774 | ||
775 | The alternative algorithm suffers from a number of disadvantages: | The alternative algorithm suffers from a number of disadvantages: |
776 | ||
777 | 1. It is substantially slower than the standard algorithm. This is | 1. It is substantially slower than the standard algorithm. This is |
778 | partly because it has to search for all possible matches, but is also | partly because it has to search for all possible matches, but is also |
779 | because it is less susceptible to optimization. | because it is less susceptible to optimization. |
780 | ||
781 | 2. Capturing parentheses and back references are not supported. | 2. Capturing parentheses and back references are not supported. |
# | Line 777 AUTHOR | Line 793 AUTHOR |
793 | ||
794 | REVISION | REVISION |
795 | ||
796 | Last updated: 19 April 2008 | Last updated: 05 September 2009 |
797 | Copyright (c) 1997-2008 University of Cambridge. | Copyright (c) 1997-2009 University of Cambridge. |
798 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
799 | ||
800 | ||
801 | PCREAPI(3) PCREAPI(3) | PCREAPI(3) PCREAPI(3) |
802 | ||
803 | ||
# | Line 889 PCRE API OVERVIEW | Line 905 PCRE API OVERVIEW |
905 | pcre_exec() are used for compiling and matching regular expressions in | pcre_exec() are used for compiling and matching regular expressions in |
906 | a Perl-compatible manner. A sample program that demonstrates the sim- | a Perl-compatible manner. A sample program that demonstrates the sim- |
907 | plest way of using them is provided in the file called pcredemo.c in | plest way of using them is provided in the file called pcredemo.c in |
908 | the source distribution. The pcresample documentation describes how to | the PCRE source distribution. A listing of this program is given in the |
909 | compile and run it. | pcredemo documentation, and the pcresample documentation describes how |
910 | to compile and run it. | |
911 | ||
912 | A second matching function, pcre_dfa_exec(), which is not Perl-compati- | A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
913 | ble, is also provided. This uses a different algorithm for the match- | ble, is also provided. This uses a different algorithm for the match- |
914 | ing. The alternative algorithm finds all possible matches (at a given | ing. The alternative algorithm finds all possible matches (at a given |
915 | point in the subject), and scans the subject just once. However, this | point in the subject), and scans the subject just once (unless there |
916 | algorithm does not return captured substrings. A description of the two | are lookbehind assertions). However, this algorithm does not return |
917 | matching algorithms and their advantages and disadvantages is given in | captured substrings. A description of the two matching algorithms and |
918 | the pcrematching documentation. | their advantages and disadvantages is given in the pcrematching docu- |
919 | mentation. | |
920 | ||
921 | In addition to the main compiling and matching functions, there are | In addition to the main compiling and matching functions, there are |
922 | convenience functions for extracting captured substrings from a subject | convenience functions for extracting captured substrings from a subject |
# | Line 999 MULTITHREADING | Line 1017 MULTITHREADING |
1017 | pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the | pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
1018 | callout function pointed to by pcre_callout, are shared by all threads. | callout function pointed to by pcre_callout, are shared by all threads. |
1019 | ||
1020 | The compiled form of a regular expression is not altered during match- | The compiled form of a regular expression is not altered during match- |
1021 | ing, so the same compiled pattern can safely be used by several threads | ing, so the same compiled pattern can safely be used by several threads |
1022 | at once. | at once. |
1023 | ||
# | Line 1007 MULTITHREADING | Line 1025 MULTITHREADING |
1025 | SAVING PRECOMPILED PATTERNS FOR LATER USE | SAVING PRECOMPILED PATTERNS FOR LATER USE |
1026 | ||
1027 | The compiled form of a regular expression can be saved and re-used at a | The compiled form of a regular expression can be saved and re-used at a |
1028 | later time, possibly by a different program, and even on a host other | later time, possibly by a different program, and even on a host other |
1029 | than the one on which it was compiled. Details are given in the | than the one on which it was compiled. Details are given in the |
1030 | pcreprecompile documentation. However, compiling a regular expression | pcreprecompile documentation. However, compiling a regular expression |
1031 | with one version of PCRE for use with a different version is not guar- | with one version of PCRE for use with a different version is not guar- |
1032 | anteed to work and may cause crashes. | anteed to work and may cause crashes. |
1033 | ||
1034 | ||
# | Line 1018 CHECKING BUILD-TIME OPTIONS | Line 1036 CHECKING BUILD-TIME OPTIONS |
1036 | ||
1037 | int pcre_config(int what, void *where); | int pcre_config(int what, void *where); |
1038 | ||
1039 | The function pcre_config() makes it possible for a PCRE client to dis- | The function pcre_config() makes it possible for a PCRE client to dis- |
1040 | cover which optional features have been compiled into the PCRE library. | cover which optional features have been compiled into the PCRE library. |
1041 | The pcrebuild documentation has more details about these optional fea- | The pcrebuild documentation has more details about these optional fea- |
1042 | tures. | tures. |
1043 | ||
1044 | The first argument for pcre_config() is an integer, specifying which | The first argument for pcre_config() is an integer, specifying which |
1045 | information is required; the second argument is a pointer to a variable | information is required; the second argument is a pointer to a variable |
1046 | into which the information is placed. The following information is | into which the information is placed. The following information is |
1047 | available: | available: |
1048 | ||
1049 | PCRE_CONFIG_UTF8 | PCRE_CONFIG_UTF8 |
1050 | ||
1051 | The output is an integer that is set to one if UTF-8 support is avail- | The output is an integer that is set to one if UTF-8 support is avail- |
1052 | able; otherwise it is set to zero. | able; otherwise it is set to zero. |
1053 | ||
1054 | PCRE_CONFIG_UNICODE_PROPERTIES | PCRE_CONFIG_UNICODE_PROPERTIES |
1055 | ||
1056 | The output is an integer that is set to one if support for Unicode | The output is an integer that is set to one if support for Unicode |
1057 | character properties is available; otherwise it is set to zero. | character properties is available; otherwise it is set to zero. |
1058 | ||
1059 | PCRE_CONFIG_NEWLINE | PCRE_CONFIG_NEWLINE |
1060 | ||
1061 | The output is an integer whose value specifies the default character | The output is an integer whose value specifies the default character |
1062 | sequence that is recognized as meaning "newline". The four values that | sequence that is recognized as meaning "newline". The four values that |
1063 | are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, | are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
1064 | and -1 for ANY. The default should normally be the standard sequence | and -1 for ANY. Though they are derived from ASCII, the same values |
1065 | for your operating system. | are returned in EBCDIC environments. The default should normally corre- |
1066 | spond to the standard sequence for your operating system. | |
1067 | ||
1068 | PCRE_CONFIG_BSR | PCRE_CONFIG_BSR |
1069 | ||
# | Line 1071 CHECKING BUILD-TIME OPTIONS | Line 1090 CHECKING BUILD-TIME OPTIONS |
1090 | ||
1091 | PCRE_CONFIG_MATCH_LIMIT | PCRE_CONFIG_MATCH_LIMIT |
1092 | ||
1093 | The output is an integer that gives the default limit for the number of | The output is a long integer that gives the default limit for the num- |
1094 | internal matching function calls in a pcre_exec() execution. Further | ber of internal matching function calls in a pcre_exec() execution. |
1095 | details are given with pcre_exec() below. | Further details are given with pcre_exec() below. |
1096 | ||
1097 | PCRE_CONFIG_MATCH_LIMIT_RECURSION | PCRE_CONFIG_MATCH_LIMIT_RECURSION |
1098 | ||
1099 | The output is an integer that gives the default limit for the depth of | The output is a long integer that gives the default limit for the depth |
1100 | recursion when calling the internal matching function in a pcre_exec() | of recursion when calling the internal matching function in a |
1101 | execution. Further details are given with pcre_exec() below. | pcre_exec() execution. Further details are given with pcre_exec() |
1102 | below. | |
1103 | ||
1104 | PCRE_CONFIG_STACKRECURSE | PCRE_CONFIG_STACKRECURSE |
1105 | ||
1106 | The output is an integer that is set to one if internal recursion when | The output is an integer that is set to one if internal recursion when |
1107 | running pcre_exec() is implemented by recursive function calls that use | running pcre_exec() is implemented by recursive function calls that use |
1108 | the stack to remember their state. This is the usual way that PCRE is | the stack to remember their state. This is the usual way that PCRE is |
1109 | compiled. The output is zero if PCRE was compiled to use blocks of data | compiled. The output is zero if PCRE was compiled to use blocks of data |
1110 | on the heap instead of recursive function calls. In this case, | on the heap instead of recursive function calls. In this case, |
1111 | pcre_stack_malloc and pcre_stack_free are called to manage memory | pcre_stack_malloc and pcre_stack_free are called to manage memory |
1112 | blocks on the heap, thus avoiding the use of the stack. | blocks on the heap, thus avoiding the use of the stack. |
1113 | ||
1114 | ||
# | Line 1105 COMPILING A PATTERN | Line 1125 COMPILING A PATTERN |
1125 | ||
1126 | Either of the functions pcre_compile() or pcre_compile2() can be called | Either of the functions pcre_compile() or pcre_compile2() can be called |
1127 | to compile a pattern into an internal form. The only difference between | to compile a pattern into an internal form. The only difference between |
1128 | the two interfaces is that pcre_compile2() has an additional argument, | the two interfaces is that pcre_compile2() has an additional argument, |
1129 | errorcodeptr, via which a numerical error code can be returned. | errorcodeptr, via which a numerical error code can be returned. |
1130 | ||
1131 | The pattern is a C string terminated by a binary zero, and is passed in | The pattern is a C string terminated by a binary zero, and is passed in |
1132 | the pattern argument. A pointer to a single block of memory that is | the pattern argument. A pointer to a single block of memory that is |
1133 | obtained via pcre_malloc is returned. This contains the compiled code | obtained via pcre_malloc is returned. This contains the compiled code |
1134 | and related data. The pcre type is defined for the returned block; this | and related data. The pcre type is defined for the returned block; this |
1135 | is a typedef for a structure whose contents are not externally defined. | is a typedef for a structure whose contents are not externally defined. |
1136 | It is up to the caller to free the memory (via pcre_free) when it is no | It is up to the caller to free the memory (via pcre_free) when it is no |
1137 | longer required. | longer required. |
1138 | ||
1139 | Although the compiled code of a PCRE regex is relocatable, that is, it | Although the compiled code of a PCRE regex is relocatable, that is, it |
1140 | does not depend on memory location, the complete pcre data block is not | does not depend on memory location, the complete pcre data block is not |
1141 | fully relocatable, because it may contain a copy of the tableptr argu- | fully relocatable, because it may contain a copy of the tableptr argu- |
1142 | ment, which is an address (see below). | ment, which is an address (see below). |
1143 | ||
1144 | The options argument contains various bit settings that affect the com- | The options argument contains various bit settings that affect the com- |
1145 | pilation. It should be zero if no options are required. The available | pilation. It should be zero if no options are required. The available |
1146 | options are described below. Some of them, in particular, those that | options are described below. Some of them (in particular, those that |
1147 | are compatible with Perl, can also be set and unset from within the | are compatible with Perl, but also some others) can also be set and |
1148 | pattern (see the detailed description in the pcrepattern documenta- | unset from within the pattern (see the detailed description in the |
1149 | tion). For these options, the contents of the options argument speci- | pcrepattern documentation). For those options that can be different in |
1150 | fies their initial settings at the start of compilation and execution. | different parts of the pattern, the contents of the options argument |
1151 | The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time | specifies their initial settings at the start of compilation and execu- |
1152 | of matching as well as at compile time. | tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the |
1153 | time of matching as well as at compile time. | |
1154 | ||
1155 | If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, | If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
1156 | if compilation of a pattern fails, pcre_compile() returns NULL, and | if compilation of a pattern fails, pcre_compile() returns NULL, and |
# | Line 1335 COMPILING A PATTERN | Line 1356 COMPILING A PATTERN |
1356 | and are therefore ignored. | and are therefore ignored. |
1357 | ||
1358 | The newline option that is set at compile time becomes the default that | The newline option that is set at compile time becomes the default that |
1359 | is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. | is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
1360 | ||
1361 | PCRE_NO_AUTO_CAPTURE | PCRE_NO_AUTO_CAPTURE |
1362 | ||
1363 | If this option is set, it disables the use of numbered capturing paren- | If this option is set, it disables the use of numbered capturing paren- |
1364 | theses in the pattern. Any opening parenthesis that is not followed by | theses in the pattern. Any opening parenthesis that is not followed by |
1365 | ? behaves as if it were followed by ?: but named parentheses can still | ? behaves as if it were followed by ?: but named parentheses can still |
1366 | be used for capturing (and they acquire numbers in the usual way). | be used for capturing (and they acquire numbers in the usual way). |
1367 | There is no equivalent of this option in Perl. | There is no equivalent of this option in Perl. |
1368 | ||
1369 | PCRE_UNGREEDY | PCRE_UNGREEDY |
1370 | ||
1371 | This option inverts the "greediness" of the quantifiers so that they | This option inverts the "greediness" of the quantifiers so that they |
1372 | are not greedy by default, but become greedy if followed by "?". It is | are not greedy by default, but become greedy if followed by "?". It is |
1373 | not compatible with Perl. It can also be set by a (?U) option setting | not compatible with Perl. It can also be set by a (?U) option setting |
1374 | within the pattern. | within the pattern. |
1375 | ||
1376 | PCRE_UTF8 | PCRE_UTF8 |
1377 | ||
1378 | This option causes PCRE to regard both the pattern and the subject as | This option causes PCRE to regard both the pattern and the subject as |
1379 | strings of UTF-8 characters instead of single-byte character strings. | strings of UTF-8 characters instead of single-byte character strings. |
1380 | However, it is available only when PCRE is built to include UTF-8 sup- | However, it is available only when PCRE is built to include UTF-8 sup- |
1381 | port. If not, the use of this option provokes an error. Details of how | port. If not, the use of this option provokes an error. Details of how |
1382 | this option changes the behaviour of PCRE are given in the section on | this option changes the behaviour of PCRE are given in the section on |
1383 | UTF-8 support in the main pcre page. | UTF-8 support in the main pcre page. |
1384 | ||
1385 | PCRE_NO_UTF8_CHECK | PCRE_NO_UTF8_CHECK |
1386 | ||
1387 | When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is | When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
1388 | automatically checked. There is a discussion about the validity of | automatically checked. There is a discussion about the validity of |
1389 | UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of | UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
1390 | bytes is found, pcre_compile() returns an error. If you already know | bytes is found, pcre_compile() returns an error. If you already know |
1391 | that your pattern is valid, and you want to skip this check for perfor- | that your pattern is valid, and you want to skip this check for perfor- |
1392 | mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is | mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
1393 | set, the effect of passing an invalid UTF-8 string as a pattern is | set, the effect of passing an invalid UTF-8 string as a pattern is |
1394 | undefined. It may cause your program to crash. Note that this option | undefined. It may cause your program to crash. Note that this option |
1395 | can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the | can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
1396 | UTF-8 validity checking of subject strings. | UTF-8 validity checking of subject strings. |
1397 | ||
1398 | ||
1399 | COMPILATION ERROR CODES | COMPILATION ERROR CODES |
1400 | ||
1401 | The following table lists the error codes than may be returned by | The following table lists the error codes than may be returned by |
1402 | pcre_compile2(), along with the error messages that may be returned by | pcre_compile2(), along with the error messages that may be returned by |
1403 | both compiling functions. As PCRE has developed, some error codes have | both compiling functions. As PCRE has developed, some error codes have |
1404 | fallen out of use. To avoid confusion, they have not been re-used. | fallen out of use. To avoid confusion, they have not been re-used. |
1405 | ||
1406 | 0 no error | 0 no error |
# | Line 1435 COMPILATION ERROR CODES | Line 1456 COMPILATION ERROR CODES |
1456 | 50 [this code is not in use] | 50 [this code is not in use] |
1457 | 51 octal value is greater than \377 (not in UTF-8 mode) | 51 octal value is greater than \377 (not in UTF-8 mode) |
1458 | 52 internal error: overran compiling workspace | 52 internal error: overran compiling workspace |
1459 | 53 internal error: previously-checked referenced subpattern not | 53 internal error: previously-checked referenced subpattern not |
1460 | found | found |
1461 | 54 DEFINE group contains more than one branch | 54 DEFINE group contains more than one branch |
1462 | 55 repeating a DEFINE group is not allowed | 55 repeating a DEFINE group is not allowed |
# | Line 1450 COMPILATION ERROR CODES | Line 1471 COMPILATION ERROR CODES |
1471 | 63 digit expected after (?+ | 63 digit expected after (?+ |
1472 | 64 ] is an invalid data character in JavaScript compatibility mode | 64 ] is an invalid data character in JavaScript compatibility mode |
1473 | ||
1474 | The numbers 32 and 10000 in errors 48 and 49 are defaults; different | The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
1475 | values may be used if the limits were changed when PCRE was built. | values may be used if the limits were changed when PCRE was built. |
1476 | ||
1477 | ||
# | Line 1459 STUDYING A PATTERN | Line 1480 STUDYING A PATTERN |
1480 | pcre_extra *pcre_study(const pcre *code, int options | pcre_extra *pcre_study(const pcre *code, int options |
1481 | const char **errptr); | const char **errptr); |
1482 | ||
1483 | If a compiled pattern is going to be used several times, it is worth | If a compiled pattern is going to be used several times, it is worth |
1484 | spending more time analyzing it in order to speed up the time taken for | spending more time analyzing it in order to speed up the time taken for |
1485 | matching. The function pcre_study() takes a pointer to a compiled pat- | matching. The function pcre_study() takes a pointer to a compiled pat- |
1486 | tern as its first argument. If studying the pattern produces additional | tern as its first argument. If studying the pattern produces additional |
1487 | information that will help speed up matching, pcre_study() returns a | information that will help speed up matching, pcre_study() returns a |
1488 | pointer to a pcre_extra block, in which the study_data field points to | pointer to a pcre_extra block, in which the study_data field points to |
1489 | the results of the study. | the results of the study. |
1490 | ||
1491 | The returned value from pcre_study() can be passed directly to | The returned value from pcre_study() can be passed directly to |
1492 | pcre_exec(). However, a pcre_extra block also contains other fields | pcre_exec(). However, a pcre_extra block also contains other fields |
1493 | that can be set by the caller before the block is passed; these are | that can be set by the caller before the block is passed; these are |
1494 | described below in the section on matching a pattern. | described below in the section on matching a pattern. |
1495 | ||
1496 | If studying the pattern does not produce any additional information | If studying the pattern does not produce any additional information |
1497 | pcre_study() returns NULL. In that circumstance, if the calling program | pcre_study() returns NULL. In that circumstance, if the calling program |
1498 | wants to pass any of the other fields to pcre_exec(), it must set up | wants to pass any of the other fields to pcre_exec(), it must set up |
1499 | its own pcre_extra block. | its own pcre_extra block. |
1500 | ||
1501 | The second argument of pcre_study() contains option bits. At present, | The second argument of pcre_study() contains option bits. At present, |
1502 | no options are defined, and this argument should always be zero. | no options are defined, and this argument should always be zero. |
1503 | ||
1504 | The third argument for pcre_study() is a pointer for an error message. | The third argument for pcre_study() is a pointer for an error message. |
1505 | If studying succeeds (even if no data is returned), the variable it | If studying succeeds (even if no data is returned), the variable it |
1506 | points to is set to NULL. Otherwise it is set to point to a textual | points to is set to NULL. Otherwise it is set to point to a textual |
1507 | error message. This is a static string that is part of the library. You | error message. This is a static string that is part of the library. You |
1508 | must not try to free it. You should test the error pointer for NULL | must not try to free it. You should test the error pointer for NULL |
1509 | after calling pcre_study(), to be sure that it has run successfully. | after calling pcre_study(), to be sure that it has run successfully. |
1510 | ||
1511 | This is a typical call to pcre_study(): | This is a typical call to pcre_study(): |
# | Line 1496 STUDYING A PATTERN | Line 1517 STUDYING A PATTERN |
1517 | &error); /* set to NULL or points to a message */ | &error); /* set to NULL or points to a message */ |
1518 | ||
1519 | At present, studying a pattern is useful only for non-anchored patterns | At present, studying a pattern is useful only for non-anchored patterns |
1520 | that do not have a single fixed starting character. A bitmap of possi- | that do not have a single fixed starting character. A bitmap of possi- |
1521 | ble starting bytes is created. | ble starting bytes is created. |
1522 | ||
1523 | ||
1524 | LOCALE SUPPORT | LOCALE SUPPORT |
1525 | ||
1526 | PCRE handles caseless matching, and determines whether characters are | PCRE handles caseless matching, and determines whether characters are |
1527 | letters, digits, or whatever, by reference to a set of tables, indexed | letters, digits, or whatever, by reference to a set of tables, indexed |
1528 | by character value. When running in UTF-8 mode, this applies only to | by character value. When running in UTF-8 mode, this applies only to |
1529 | characters with codes less than 128. Higher-valued codes never match | characters with codes less than 128. Higher-valued codes never match |
1530 | escapes such as \w or \d, but can be tested with \p if PCRE is built | escapes such as \w or \d, but can be tested with \p if PCRE is built |
1531 | with Unicode character property support. The use of locales with Uni- | with Unicode character property support. The use of locales with Uni- |
1532 | code is discouraged. If you are handling characters with codes greater | code is discouraged. If you are handling characters with codes greater |
1533 | than 128, you should either use UTF-8 and Unicode, or use locales, but | than 128, you should either use UTF-8 and Unicode, or use locales, but |
1534 | not try to mix the two. | not try to mix the two. |
1535 | ||
1536 | PCRE contains an internal set of tables that are used when the final | PCRE contains an internal set of tables that are used when the final |
1537 | argument of pcre_compile() is NULL. These are sufficient for many | argument of pcre_compile() is NULL. These are sufficient for many |
1538 | applications. Normally, the internal tables recognize only ASCII char- | applications. Normally, the internal tables recognize only ASCII char- |
1539 | acters. However, when PCRE is built, it is possible to cause the inter- | acters. However, when PCRE is built, it is possible to cause the inter- |
1540 | nal tables to be rebuilt in the default "C" locale of the local system, | nal tables to be rebuilt in the default "C" locale of the local system, |
1541 | which may cause them to be different. | which may cause them to be different. |
1542 | ||
1543 | The internal tables can always be overridden by tables supplied by the | The internal tables can always be overridden by tables supplied by the |
1544 | application that calls PCRE. These may be created in a different locale | application that calls PCRE. These may be created in a different locale |
1545 | from the default. As more and more applications change to using Uni- | from the default. As more and more applications change to using Uni- |
1546 | code, the need for this locale support is expected to die away. | code, the need for this locale support is expected to die away. |
1547 | ||
1548 | External tables are built by calling the pcre_maketables() function, | External tables are built by calling the pcre_maketables() function, |
1549 | which has no arguments, in the relevant locale. The result can then be | which has no arguments, in the relevant locale. The result can then be |
1550 | passed to pcre_compile() or pcre_exec() as often as necessary. For | passed to pcre_compile() or pcre_exec() as often as necessary. For |
1551 | example, to build and use tables that are appropriate for the French | example, to build and use tables that are appropriate for the French |
1552 | locale (where accented characters with values greater than 128 are | locale (where accented characters with values greater than 128 are |
1553 | treated as letters), the following code could be used: | treated as letters), the following code could be used: |
1554 | ||
1555 | setlocale(LC_CTYPE, "fr_FR"); | setlocale(LC_CTYPE, "fr_FR"); |
1556 | tables = pcre_maketables(); | tables = pcre_maketables(); |
1557 | re = pcre_compile(..., tables); | re = pcre_compile(..., tables); |
1558 | ||
1559 | The locale name "fr_FR" is used on Linux and other Unix-like systems; | The locale name "fr_FR" is used on Linux and other Unix-like systems; |
1560 | if you are using Windows, the name for the French locale is "french". | if you are using Windows, the name for the French locale is "french". |
1561 | ||
1562 | When pcre_maketables() runs, the tables are built in memory that is | When pcre_maketables() runs, the tables are built in memory that is |
1563 | obtained via pcre_malloc. It is the caller's responsibility to ensure | obtained via pcre_malloc. It is the caller's responsibility to ensure |
1564 | that the memory containing the tables remains available for as long as | that the memory containing the tables remains available for as long as |
1565 | it is needed. | it is needed. |
1566 | ||
1567 | The pointer that is passed to pcre_compile() is saved with the compiled | The pointer that is passed to pcre_compile() is saved with the compiled |
1568 | pattern, and the same tables are used via this pointer by pcre_study() | pattern, and the same tables are used via this pointer by pcre_study() |
1569 | and normally also by pcre_exec(). Thus, by default, for any single pat- | and normally also by pcre_exec(). Thus, by default, for any single pat- |
1570 | tern, compilation, studying and matching all happen in the same locale, | tern, compilation, studying and matching all happen in the same locale, |
1571 | but different patterns can be compiled in different locales. | but different patterns can be compiled in different locales. |
1572 | ||
1573 | It is possible to pass a table pointer or NULL (indicating the use of | It is possible to pass a table pointer or NULL (indicating the use of |
1574 | the internal tables) to pcre_exec(). Although not intended for this | the internal tables) to pcre_exec(). Although not intended for this |
1575 | purpose, this facility could be used to match a pattern in a different | purpose, this facility could be used to match a pattern in a different |
1576 | locale from the one in which it was compiled. Passing table pointers at | locale from the one in which it was compiled. Passing table pointers at |
1577 | run time is discussed below in the section on matching a pattern. | run time is discussed below in the section on matching a pattern. |
1578 | ||
# | Line 1561 INFORMATION ABOUT A PATTERN | Line 1582 INFORMATION ABOUT A PATTERN |
1582 | int pcre_fullinfo(const pcre *code, const pcre_extra *extra, | int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
1583 | int what, void *where); | int what, void *where); |
1584 | ||
1585 | The pcre_fullinfo() function returns information about a compiled pat- | The pcre_fullinfo() function returns information about a compiled pat- |
1586 | tern. It replaces the obsolete pcre_info() function, which is neverthe- | tern. It replaces the obsolete pcre_info() function, which is neverthe- |
1587 | less retained for backwards compability (and is documented below). | less retained for backwards compability (and is documented below). |
1588 | ||
1589 | The first argument for pcre_fullinfo() is a pointer to the compiled | The first argument for pcre_fullinfo() is a pointer to the compiled |
1590 | pattern. The second argument is the result of pcre_study(), or NULL if | pattern. The second argument is the result of pcre_study(), or NULL if |
1591 | the pattern was not studied. The third argument specifies which piece | the pattern was not studied. The third argument specifies which piece |
1592 | of information is required, and the fourth argument is a pointer to a | of information is required, and the fourth argument is a pointer to a |
1593 | variable to receive the data. The yield of the function is zero for | variable to receive the data. The yield of the function is zero for |
1594 | success, or one of the following negative numbers: | success, or one of the following negative numbers: |
1595 | ||
1596 | PCRE_ERROR_NULL the argument code was NULL | PCRE_ERROR_NULL the argument code was NULL |
# | Line 1577 INFORMATION ABOUT A PATTERN | Line 1598 INFORMATION ABOUT A PATTERN |
1598 | PCRE_ERROR_BADMAGIC the "magic number" was not found | PCRE_ERROR_BADMAGIC the "magic number" was not found |
1599 | PCRE_ERROR_BADOPTION the value of what was invalid | PCRE_ERROR_BADOPTION the value of what was invalid |
1600 | ||
1601 | The "magic number" is placed at the start of each compiled pattern as | The "magic number" is placed at the start of each compiled pattern as |
1602 | an simple check against passing an arbitrary memory pointer. Here is a | an simple check against passing an arbitrary memory pointer. Here is a |
1603 | typical call of pcre_fullinfo(), to obtain the length of the compiled | typical call of pcre_fullinfo(), to obtain the length of the compiled |
1604 | pattern: | pattern: |
1605 | ||
1606 | int rc; | int rc; |
# | Line 1590 INFORMATION ABOUT A PATTERN | Line 1611 INFORMATION ABOUT A PATTERN |
1611 | PCRE_INFO_SIZE, /* what is required */ | PCRE_INFO_SIZE, /* what is required */ |
1612 | &length); /* where to put the data */ | &length); /* where to put the data */ |
1613 | ||
1614 | The possible values for the third argument are defined in pcre.h, and | The possible values for the third argument are defined in pcre.h, and |
1615 | are as follows: | are as follows: |
1616 | ||
1617 | PCRE_INFO_BACKREFMAX | PCRE_INFO_BACKREFMAX |
1618 | ||
1619 | Return the number of the highest back reference in the pattern. The | Return the number of the highest back reference in the pattern. The |
1620 | fourth argument should point to an int variable. Zero is returned if | fourth argument should point to an int variable. Zero is returned if |
1621 | there are no back references. | there are no back references. |
1622 | ||
1623 | PCRE_INFO_CAPTURECOUNT | PCRE_INFO_CAPTURECOUNT |
1624 | ||
1625 | Return the number of capturing subpatterns in the pattern. The fourth | Return the number of capturing subpatterns in the pattern. The fourth |
1626 | argument should point to an int variable. | argument should point to an int variable. |
1627 | ||
1628 | PCRE_INFO_DEFAULT_TABLES | PCRE_INFO_DEFAULT_TABLES |
1629 | ||
1630 | Return a pointer to the internal default character tables within PCRE. | Return a pointer to the internal default character tables within PCRE. |
1631 | The fourth argument should point to an unsigned char * variable. This | The fourth argument should point to an unsigned char * variable. This |
1632 | information call is provided for internal use by the pcre_study() func- | information call is provided for internal use by the pcre_study() func- |
1633 | tion. External callers can cause PCRE to use its internal tables by | tion. External callers can cause PCRE to use its internal tables by |
1634 | passing a NULL table pointer. | passing a NULL table pointer. |
1635 | ||
1636 | PCRE_INFO_FIRSTBYTE | PCRE_INFO_FIRSTBYTE |
1637 | ||
1638 | Return information about the first byte of any matched string, for a | Return information about the first byte of any matched string, for a |
1639 | non-anchored pattern. The fourth argument should point to an int vari- | non-anchored pattern. The fourth argument should point to an int vari- |
1640 | able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name | able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
1641 | is still recognized for backwards compatibility.) | is still recognized for backwards compatibility.) |
1642 | ||
1643 | If there is a fixed first byte, for example, from a pattern such as | If there is a fixed first byte, for example, from a pattern such as |
1644 | (cat|cow|coyote), its value is returned. Otherwise, if either | (cat|cow|coyote), its value is returned. Otherwise, if either |
1645 | ||
1646 | (a) the pattern was compiled with the PCRE_MULTILINE option, and every | (a) the pattern was compiled with the PCRE_MULTILINE option, and every |
1647 | branch starts with "^", or | branch starts with "^", or |
1648 | ||
1649 | (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not | (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
1650 | set (if it were set, the pattern would be anchored), | set (if it were set, the pattern would be anchored), |
1651 | ||
1652 | -1 is returned, indicating that the pattern matches only at the start | -1 is returned, indicating that the pattern matches only at the start |
1653 | of a subject string or after any newline within the string. Otherwise | of a subject string or after any newline within the string. Otherwise |
1654 | -2 is returned. For anchored patterns, -2 is returned. | -2 is returned. For anchored patterns, -2 is returned. |
1655 | ||
1656 | PCRE_INFO_FIRSTTABLE | PCRE_INFO_FIRSTTABLE |
1657 | ||
1658 | If the pattern was studied, and this resulted in the construction of a | If the pattern was studied, and this resulted in the construction of a |
1659 | 256-bit table indicating a fixed set of bytes for the first byte in any | 256-bit table indicating a fixed set of bytes for the first byte in any |
1660 | matching string, a pointer to the table is returned. Otherwise NULL is | matching string, a pointer to the table is returned. Otherwise NULL is |
1661 | returned. The fourth argument should point to an unsigned char * vari- | returned. The fourth argument should point to an unsigned char * vari- |
1662 | able. | able. |
1663 | ||
1664 | PCRE_INFO_HASCRORLF | PCRE_INFO_HASCRORLF |
1665 | ||
1666 | Return 1 if the pattern contains any explicit matches for CR or LF | Return 1 if the pattern contains any explicit matches for CR or LF |
1667 | characters, otherwise 0. The fourth argument should point to an int | characters, otherwise 0. The fourth argument should point to an int |
1668 | variable. An explicit match is either a literal CR or LF character, or | variable. An explicit match is either a literal CR or LF character, or |
1669 | \r or \n. | \r or \n. |
1670 | ||
1671 | PCRE_INFO_JCHANGED | PCRE_INFO_JCHANGED |
1672 | ||
1673 | Return 1 if the (?J) or (?-J) option setting is used in the pattern, | Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
1674 | otherwise 0. The fourth argument should point to an int variable. (?J) | otherwise 0. The fourth argument should point to an int variable. (?J) |
1675 | and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. | and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
1676 | ||
1677 | PCRE_INFO_LASTLITERAL | PCRE_INFO_LASTLITERAL |
1678 | ||
1679 | Return the value of the rightmost literal byte that must exist in any | Return the value of the rightmost literal byte that must exist in any |
1680 | matched string, other than at its start, if such a byte has been | matched string, other than at its start, if such a byte has been |
1681 | recorded. The fourth argument should point to an int variable. If there | recorded. The fourth argument should point to an int variable. If there |
1682 | is no such byte, -1 is returned. For anchored patterns, a last literal | is no such byte, -1 is returned. For anchored patterns, a last literal |
1683 | byte is recorded only if it follows something of variable length. For | byte is recorded only if it follows something of variable length. For |
1684 | example, for the pattern /^a\d+z\d+/ the returned value is "z", but for | example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
1685 | /^a\dz\d/ the returned value is -1. | /^a\dz\d/ the returned value is -1. |
1686 | ||
# | Line 1667 INFORMATION ABOUT A PATTERN | Line 1688 INFORMATION ABOUT A PATTERN |
1688 | PCRE_INFO_NAMEENTRYSIZE | PCRE_INFO_NAMEENTRYSIZE |
1689 | PCRE_INFO_NAMETABLE | PCRE_INFO_NAMETABLE |
1690 | ||
1691 | PCRE supports the use of named as well as numbered capturing parenthe- | PCRE supports the use of named as well as numbered capturing parenthe- |
1692 | ses. The names are just an additional way of identifying the parenthe- | ses. The names are just an additional way of identifying the parenthe- |
1693 | ses, which still acquire numbers. Several convenience functions such as | ses, which still acquire numbers. Several convenience functions such as |
1694 | pcre_get_named_substring() are provided for extracting captured sub- | pcre_get_named_substring() are provided for extracting captured sub- |
1695 | strings by name. It is also possible to extract the data directly, by | strings by name. It is also possible to extract the data directly, by |
1696 | first converting the name to a number in order to access the correct | first converting the name to a number in order to access the correct |
1697 | pointers in the output vector (described with pcre_exec() below). To do | pointers in the output vector (described with pcre_exec() below). To do |
1698 | the conversion, you need to use the name-to-number map, which is | the conversion, you need to use the name-to-number map, which is |
1699 | described by these three values. | described by these three values. |
1700 | ||
1701 | The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT | The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
1702 | gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size | gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
1703 | of each entry; both of these return an int value. The entry size | of each entry; both of these return an int value. The entry size |
1704 | depends on the length of the longest name. PCRE_INFO_NAMETABLE returns | depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
1705 | a pointer to the first entry of the table (a pointer to char). The | a pointer to the first entry of the table (a pointer to char). The |
1706 | first two bytes of each entry are the number of the capturing parenthe- | first two bytes of each entry are the number of the capturing parenthe- |
1707 | sis, most significant byte first. The rest of the entry is the corre- | sis, most significant byte first. The rest of the entry is the corre- |
1708 | sponding name, zero terminated. The names are in alphabetical order. | sponding name, zero terminated. The names are in alphabetical order. |
1709 | When PCRE_DUPNAMES is set, duplicate names are in order of their paren- | When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
1710 | theses numbers. For example, consider the following pattern (assume | theses numbers. For example, consider the following pattern (assume |
1711 | PCRE_EXTENDED is set, so white space - including newlines - is | PCRE_EXTENDED is set, so white space - including newlines - is |
1712 | ignored): | ignored): |
1713 | ||
1714 | (?<date> (?<year>(\d\d)?\d\d) - | (?<date> (?<year>(\d\d)?\d\d) - |
1715 | (?<month>\d\d) - (?<day>\d\d) ) | (?<month>\d\d) - (?<day>\d\d) ) |
1716 | ||
1717 | There are four named subpatterns, so the table has four entries, and | There are four named subpatterns, so the table has four entries, and |
1718 | each entry in the table is eight bytes long. The table is as follows, | each entry in the table is eight bytes long. The table is as follows, |
1719 | with non-printing bytes shows in hexadecimal, and undefined bytes shown | with non-printing bytes shows in hexadecimal, and undefined bytes shown |
1720 | as ??: | as ??: |
1721 | ||
# | Line 1703 INFORMATION ABOUT A PATTERN | Line 1724 INFORMATION ABOUT A PATTERN |
1724 | 00 04 m o n t h 00 | 00 04 m o n t h 00 |
1725 | 00 02 y e a r 00 ?? | 00 02 y e a r 00 ?? |
1726 | ||
1727 | When writing code to extract data from named subpatterns using the | When writing code to extract data from named subpatterns using the |
1728 | name-to-number map, remember that the length of the entries is likely | name-to-number map, remember that the length of the entries is likely |
1729 | to be different for each compiled pattern. | to be different for each compiled pattern. |
1730 | ||
1731 | PCRE_INFO_OKPARTIAL | PCRE_INFO_OKPARTIAL |
1732 | ||
1733 | Return 1 if the pattern can be used for partial matching, otherwise 0. | Return 1 if the pattern can be used for partial matching with |
1734 | The fourth argument should point to an int variable. The pcrepartial | pcre_exec(), otherwise 0. The fourth argument should point to an int |
1735 | documentation lists the restrictions that apply to patterns when par- | variable. From release 8.00, this always returns 1, because the |
1736 | tial matching is used. | restrictions that previously applied to partial matching have been |
1737 | lifted. The pcrepartial documentation gives details of partial match- | |
1738 | ing. | |
1739 | ||
1740 | PCRE_INFO_OPTIONS | PCRE_INFO_OPTIONS |
1741 | ||
1742 | Return a copy of the options with which the pattern was compiled. The | Return a copy of the options with which the pattern was compiled. The |
1743 | fourth argument should point to an unsigned long int variable. These | fourth argument should point to an unsigned long int variable. These |
1744 | option bits are those specified in the call to pcre_compile(), modified | option bits are those specified in the call to pcre_compile(), modified |
1745 | by any top-level option settings at the start of the pattern itself. In | by any top-level option settings at the start of the pattern itself. In |
1746 | other words, they are the options that will be in force when matching | other words, they are the options that will be in force when matching |
1747 | starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with | starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
1748 | the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, | the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
1749 | and PCRE_EXTENDED. | and PCRE_EXTENDED. |
1750 | ||
1751 | A pattern is automatically anchored by PCRE if all of its top-level | A pattern is automatically anchored by PCRE if all of its top-level |
1752 | alternatives begin with one of the following: | alternatives begin with one of the following: |
1753 | ||
1754 | ^ unless PCRE_MULTILINE is set | ^ unless PCRE_MULTILINE is set |
# | Line 1739 INFORMATION ABOUT A PATTERN | Line 1762 INFORMATION ABOUT A PATTERN |
1762 | ||
1763 | PCRE_INFO_SIZE | PCRE_INFO_SIZE |
1764 | ||
1765 | Return the size of the compiled pattern, that is, the value that was | Return the size of the compiled pattern, that is, the value that was |
1766 | passed as the argument to pcre_malloc() when PCRE was getting memory in | passed as the argument to pcre_malloc() when PCRE was getting memory in |
1767 | which to place the compiled data. The fourth argument should point to a | which to place the compiled data. The fourth argument should point to a |
1768 | size_t variable. | size_t variable. |
# | Line 1747 INFORMATION ABOUT A PATTERN | Line 1770 INFORMATION ABOUT A PATTERN |
1770 | PCRE_INFO_STUDYSIZE | PCRE_INFO_STUDYSIZE |
1771 | ||
1772 | Return the size of the data block pointed to by the study_data field in | Return the size of the data block pointed to by the study_data field in |
1773 | a pcre_extra block. That is, it is the value that was passed to | a pcre_extra block. That is, it is the value that was passed to |
1774 | pcre_malloc() when PCRE was getting memory into which to place the data | pcre_malloc() when PCRE was getting memory into which to place the data |
1775 | created by pcre_study(). The fourth argument should point to a size_t | created by pcre_study(). The fourth argument should point to a size_t |
1776 | variable. | variable. |
1777 | ||
1778 | ||
# | Line 1757 OBSOLETE INFO FUNCTION | Line 1780 OBSOLETE INFO FUNCTION |
1780 | ||
1781 | int pcre_info(const pcre *code, int *optptr, int *firstcharptr); | int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
1782 | ||
1783 | The pcre_info() function is now obsolete because its interface is too | The pcre_info() function is now obsolete because its interface is too |
1784 | restrictive to return all the available data about a compiled pattern. | restrictive to return all the available data about a compiled pattern. |
1785 | New programs should use pcre_fullinfo() instead. The yield of | New programs should use pcre_fullinfo() instead. The yield of |
1786 | pcre_info() is the number of capturing subpatterns, or one of the fol- | pcre_info() is the number of capturing subpatterns, or one of the fol- |
1787 | lowing negative numbers: | lowing negative numbers: |
1788 | ||
1789 | PCRE_ERROR_NULL the argument code was NULL | PCRE_ERROR_NULL the argument code was NULL |
1790 | PCRE_ERROR_BADMAGIC the "magic number" was not found | PCRE_ERROR_BADMAGIC the "magic number" was not found |
1791 | ||
1792 | If the optptr argument is not NULL, a copy of the options with which | If the optptr argument is not NULL, a copy of the options with which |
1793 | the pattern was compiled is placed in the integer it points to (see | the pattern was compiled is placed in the integer it points to (see |
1794 | PCRE_INFO_OPTIONS above). | PCRE_INFO_OPTIONS above). |
1795 | ||
1796 | If the pattern is not anchored and the firstcharptr argument is not | If the pattern is not anchored and the firstcharptr argument is not |
1797 | NULL, it is used to pass back information about the first character of | NULL, it is used to pass back information about the first character of |
1798 | any matched string (see PCRE_INFO_FIRSTBYTE above). | any matched string (see PCRE_INFO_FIRSTBYTE above). |
1799 | ||
1800 | ||
# | Line 1779 REFERENCE COUNTS | Line 1802 REFERENCE COUNTS |
1802 | ||
1803 | int pcre_refcount(pcre *code, int adjust); | int pcre_refcount(pcre *code, int adjust); |
1804 | ||
1805 | The pcre_refcount() function is used to maintain a reference count in | The pcre_refcount() function is used to maintain a reference count in |
1806 | the data block that contains a compiled pattern. It is provided for the | the data block that contains a compiled pattern. It is provided for the |
1807 | benefit of applications that operate in an object-oriented manner, | benefit of applications that operate in an object-oriented manner, |
1808 | where different parts of the application may be using the same compiled | where different parts of the application may be using the same compiled |
1809 | pattern, but you want to free the block when they are all done. | pattern, but you want to free the block when they are all done. |
1810 | ||
1811 | When a pattern is compiled, the reference count field is initialized to | When a pattern is compiled, the reference count field is initialized to |
1812 | zero. It is changed only by calling this function, whose action is to | zero. It is changed only by calling this function, whose action is to |
1813 | add the adjust value (which may be positive or negative) to it. The | add the adjust value (which may be positive or negative) to it. The |
1814 | yield of the function is the new value. However, the value of the count | yield of the function is the new value. However, the value of the count |
1815 | is constrained to lie between 0 and 65535, inclusive. If the new value | is constrained to lie between 0 and 65535, inclusive. If the new value |
1816 | is outside these limits, it is forced to the appropriate limit value. | is outside these limits, it is forced to the appropriate limit value. |
1817 | ||
1818 | Except when it is zero, the reference count is not correctly preserved | Except when it is zero, the reference count is not correctly preserved |
1819 | if a pattern is compiled on one host and then transferred to a host | if a pattern is compiled on one host and then transferred to a host |
1820 | whose byte-order is different. (This seems a highly unlikely scenario.) | whose byte-order is different. (This seems a highly unlikely scenario.) |
1821 | ||
1822 | ||
# | Line 1887 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 1910 MATCHING A PATTERN: THE TRADITIONAL FUNC |
1910 | the total number of calls, because not all calls to match() are recur- | the total number of calls, because not all calls to match() are recur- |
1911 | sive. This limit is of use only if it is set smaller than match_limit. | sive. This limit is of use only if it is set smaller than match_limit. |
1912 | ||
1913 | Limiting the recursion depth limits the amount of stack that can be | Limiting the recursion depth limits the amount of stack that can be |
1914 | used, or, when PCRE has been compiled to use memory on the heap instead | used, or, when PCRE has been compiled to use memory on the heap instead |
1915 | of the stack, the amount of heap memory that can be used. | of the stack, the amount of heap memory that can be used. |
1916 | ||
1917 | The default value for match_limit_recursion can be set when PCRE is | The default value for match_limit_recursion can be set when PCRE is |
1918 | built; the default default is the same value as the default for | built; the default default is the same value as the default for |
1919 | match_limit. You can override the default by suppling pcre_exec() with | match_limit. You can override the default by suppling pcre_exec() with |
1920 | a pcre_extra block in which match_limit_recursion is set, and | a pcre_extra block in which match_limit_recursion is set, and |
1921 | PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the | PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
1922 | limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. | limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
1923 | ||
1924 | The pcre_callout field is used in conjunction with the "callout" fea- | The callout_data field is used in conjunction with the "callout" fea- |
1925 | ture, which is described in the pcrecallout documentation. | ture, and is described in the pcrecallout documentation. |
1926 | ||
1927 | The tables field is used to pass a character tables pointer to | The tables field is used to pass a character tables pointer to |
1928 | pcre_exec(); this overrides the value that is stored with the compiled | pcre_exec(); this overrides the value that is stored with the compiled |
1929 | pattern. A non-NULL value is stored with the compiled pattern only if | pattern. A non-NULL value is stored with the compiled pattern only if |
1930 | custom tables were supplied to pcre_compile() via its tableptr argu- | custom tables were supplied to pcre_compile() via its tableptr argu- |
1931 | ment. If NULL is passed to pcre_exec() using this mechanism, it forces | ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
1932 | PCRE's internal tables to be used. This facility is helpful when re- | PCRE's internal tables to be used. This facility is helpful when re- |
1933 | using patterns that have been saved after compiling with an external | using patterns that have been saved after compiling with an external |
1934 | set of tables, because the external tables might be at a different | set of tables, because the external tables might be at a different |
1935 | address when pcre_exec() is called. See the pcreprecompile documenta- | address when pcre_exec() is called. See the pcreprecompile documenta- |
1936 | tion for a discussion of saving compiled patterns for later use. | tion for a discussion of saving compiled patterns for later use. |
1937 | ||
1938 | Option bits for pcre_exec() | Option bits for pcre_exec() |
1939 | ||
1940 | The unused bits of the options argument for pcre_exec() must be zero. | The unused bits of the options argument for pcre_exec() must be zero. |
1941 | The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, | The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
1942 | PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and | PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, |
1943 | PCRE_PARTIAL. | PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and |
1944 | PCRE_PARTIAL_HARD. | |
1945 | ||
1946 | PCRE_ANCHORED | PCRE_ANCHORED |
1947 | ||
# | Line 1997 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2021 MATCHING A PATTERN: THE TRADITIONAL FUNC |
2021 | ||
2022 | a?b? | a?b? |
2023 | ||
2024 | is applied to a string not beginning with "a" or "b", it matches the | is applied to a string not beginning with "a" or "b", it matches an |
2025 | empty string at the start of the subject. With PCRE_NOTEMPTY set, this | empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
2026 | match is not valid, so PCRE searches further into the string for occur- | match is not valid, so PCRE searches further into the string for occur- |
2027 | rences of "a" or "b". | rences of "a" or "b". |
2028 | ||
2029 | Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- | PCRE_NOTEMPTY_ATSTART |
2030 | cial case of a pattern match of the empty string within its split() | |
2031 | function, and when using the /g modifier. It is possible to emulate | This is like PCRE_NOTEMPTY, except that an empty string match that is |
2032 | Perl's behaviour after matching a null string by first trying the match | not at the start of the subject is permitted. If the pattern is |
2033 | again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then | anchored, such a match can occur only if the pattern contains \K. |
2034 | if that fails by advancing the starting offset (see below) and trying | |
2035 | an ordinary match again. There is some code that demonstrates how to do | Perl has no direct equivalent of PCRE_NOTEMPTY or |
2036 | this in the pcredemo.c sample program. | PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern |
2037 | match of the empty string within its split() function, and when using | |
2038 | the /g modifier. It is possible to emulate Perl's behaviour after | |
2039 | matching a null string by first trying the match again at the same off- | |
2040 | set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that | |
2041 | fails, by advancing the starting offset (see below) and trying an ordi- | |
2042 | nary match again. There is some code that demonstrates how to do this | |
2043 | in the pcredemo sample program. | |
2044 | ||
2045 | PCRE_NO_START_OPTIMIZE | |
2046 | ||
2047 | There are a number of optimizations that pcre_exec() uses at the start | |
2048 | of a match, in order to speed up the process. For example, if it is | |
2049 | known that a match must start with a specific character, it searches | |
2050 | the subject for that character, and fails immediately if it cannot find | |
2051 | it, without actually running the main matching function. When callouts | |
2052 | are in use, these optimizations can cause them to be skipped. This | |
2053 | option disables the "start-up" optimizations, causing performance to | |
2054 | suffer, but ensuring that the callouts do occur. | |
2055 | ||
2056 | PCRE_NO_UTF8_CHECK | PCRE_NO_UTF8_CHECK |
2057 | ||
# | Line 2033 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2075 MATCHING A PATTERN: THE TRADITIONAL FUNC |
2075 | value of startoffset that does not point to the start of a UTF-8 char- | value of startoffset that does not point to the start of a UTF-8 char- |
2076 | acter, is undefined. Your program may crash. | acter, is undefined. Your program may crash. |
2077 | ||
2078 | PCRE_PARTIAL | PCRE_PARTIAL_HARD |
2079 | PCRE_PARTIAL_SOFT | |
2080 | ||
2081 | This option turns on the partial matching feature. If the subject | These options turn on the partial matching feature. For backwards com- |
2082 | string fails to match the pattern, but at some point during the match- | patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial |
2083 | ing process the end of the subject was reached (that is, the subject | match occurs if the end of the subject string is reached successfully, |
2084 | partially matches the pattern and the failure to match occurred only | but there are not enough subject characters to complete the match. If |
2085 | because there were not enough subject characters), pcre_exec() returns | this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately |
2086 | PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is | returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, |
2087 | used, there are restrictions on what may appear in the pattern. These | matching continues by testing any other alternatives. Only if they all |
2088 | are discussed in the pcrepartial documentation. | fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH). |
2089 | The portion of the string that was inspected when the partial match was | |
2090 | found is set as the first matching string. There is a more detailed | |
2091 | discussion in the pcrepartial documentation. | |
2092 | ||
2093 | The string to be matched by pcre_exec() | The string to be matched by pcre_exec() |
2094 | ||
2095 | The subject string is passed to pcre_exec() as a pointer in subject, a | The subject string is passed to pcre_exec() as a pointer in subject, a |
2096 | length in length, and a starting byte offset in startoffset. In UTF-8 | length (in bytes) in length, and a starting byte offset in startoffset. |
2097 | mode, the byte offset must point to the start of a UTF-8 character. | In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
2098 | Unlike the pattern string, the subject may contain binary zero bytes. | acter. Unlike the pattern string, the subject may contain binary zero |
2099 | When the starting offset is zero, the search for a match starts at the | bytes. When the starting offset is zero, the search for a match starts |
2100 | beginning of the subject, and this is by far the most common case. | at the beginning of the subject, and this is by far the most common |
2101 | case. | |
2102 | ||
2103 | A non-zero starting offset is useful when searching for another match | A non-zero starting offset is useful when searching for another match |
2104 | in the same subject by calling pcre_exec() again after a previous suc- | in the same subject by calling pcre_exec() again after a previous suc- |
# | Line 2087 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2134 MATCHING A PATTERN: THE TRADITIONAL FUNC |
2134 | string. PCRE supports several other kinds of parenthesized subpattern | string. PCRE supports several other kinds of parenthesized subpattern |
2135 | that do not cause substrings to be captured. | that do not cause substrings to be captured. |
2136 | ||
2137 | Captured substrings are returned to the caller via a vector of integer | Captured substrings are returned to the caller via a vector of integers |
2138 | offsets whose address is passed in ovector. The number of elements in | whose address is passed in ovector. The number of elements in the vec- |
2139 | the vector is passed in ovecsize, which must be a non-negative number. | tor is passed in ovecsize, which must be a non-negative number. Note: |
2140 | Note: this argument is NOT the size of ovector in bytes. | this argument is NOT the size of ovector in bytes. |
2141 | ||
2142 | The first two-thirds of the vector is used to pass back captured sub- | The first two-thirds of the vector is used to pass back captured sub- |
2143 | strings, each substring using a pair of integers. The remaining third | strings, each substring using a pair of integers. The remaining third |
2144 | of the vector is used as workspace by pcre_exec() while matching cap- | of the vector is used as workspace by pcre_exec() while matching cap- |
2145 | turing subpatterns, and is not available for passing back information. | turing subpatterns, and is not available for passing back information. |
2146 | The length passed in ovecsize should always be a multiple of three. If | The number passed in ovecsize should always be a multiple of three. If |
2147 | it is not, it is rounded down. | it is not, it is rounded down. |
2148 | ||
2149 | When a match is successful, information about captured substrings is | When a match is successful, information about captured substrings is |
2150 | returned in pairs of integers, starting at the beginning of ovector, | returned in pairs of integers, starting at the beginning of ovector, |
2151 | and continuing up to two-thirds of its length at the most. The first | and continuing up to two-thirds of its length at the most. The first |
2152 | element of a pair is set to the offset of the first character in a sub- | element of each pair is set to the byte offset of the first character |
2153 | string, and the second is set to the offset of the first character | in a substring, and the second is set to the byte offset of the first |
2154 | after the end of a substring. The first pair, ovector[0] and ovec- | character after the end of a substring. Note: these values are always |
2155 | tor[1], identify the portion of the subject string matched by the | byte offsets, even in UTF-8 mode. They are not character counts. |
2156 | entire pattern. The next pair is used for the first capturing subpat- | |
2157 | tern, and so on. The value returned by pcre_exec() is one more than the | The first pair of integers, ovector[0] and ovector[1], identify the |
2158 | highest numbered pair that has been set. For example, if two substrings | portion of the subject string matched by the entire pattern. The next |
2159 | have been captured, the returned value is 3. If there are no capturing | pair is used for the first capturing subpattern, and so on. The value |
2160 | subpatterns, the return value from a successful match is 1, indicating | returned by pcre_exec() is one more than the highest numbered pair that |
2161 | that just the first pair of offsets has been set. | has been set. For example, if two substrings have been captured, the |
2162 | returned value is 3. If there are no capturing subpatterns, the return | |
2163 | value from a successful match is 1, indicating that just the first pair | |
2164 | of offsets has been set. | |
2165 | ||
2166 | If a capturing subpattern is matched repeatedly, it is the last portion | If a capturing subpattern is matched repeatedly, it is the last portion |
2167 | of the string that it matched that is returned. | of the string that it matched that is returned. |
2168 | ||
2169 | If the vector is too small to hold all the captured substring offsets, | If the vector is too small to hold all the captured substring offsets, |
2170 | it is used as far as possible (up to two-thirds of its length), and the | it is used as far as possible (up to two-thirds of its length), and the |
2171 | function returns a value of zero. In particular, if the substring off- | function returns a value of zero. If the substring offsets are not of |
2172 | sets are not of interest, pcre_exec() may be called with ovector passed | interest, pcre_exec() may be called with ovector passed as NULL and |
2173 | as NULL and ovecsize as zero. However, if the pattern contains back | ovecsize as zero. However, if the pattern contains back references and |
2174 | references and the ovector is not big enough to remember the related | the ovector is not big enough to remember the related substrings, PCRE |
2175 | substrings, PCRE has to get additional memory for use during matching. | has to get additional memory for use during matching. Thus it is usu- |
2176 | Thus it is usually advisable to supply an ovector. | ally advisable to supply an ovector. |
2177 | ||
2178 | The pcre_info() function can be used to find out how many capturing | The pcre_info() function can be used to find out how many capturing |
2179 | subpatterns there are in a compiled pattern. The smallest size for | subpatterns there are in a compiled pattern. The smallest size for |
2180 | ovector that will allow for n captured substrings, in addition to the | ovector that will allow for n captured substrings, in addition to the |
2181 | offsets of the substring matched by the whole pattern, is (n+1)*3. | offsets of the substring matched by the whole pattern, is (n+1)*3. |
2182 | ||
2183 | It is possible for capturing subpattern number n+1 to match some part | It is possible for capturing subpattern number n+1 to match some part |
2184 | of the subject when subpattern n has not been used at all. For example, | of the subject when subpattern n has not been used at all. For example, |
2185 | if the string "abc" is matched against the pattern (a|(z))(bc) the | if the string "abc" is matched against the pattern (a|(z))(bc) the |
2186 | return from the function is 4, and subpatterns 1 and 3 are matched, but | return from the function is 4, and subpatterns 1 and 3 are matched, but |
2187 | 2 is not. When this happens, both values in the offset pairs corre- | 2 is not. When this happens, both values in the offset pairs corre- |
2188 | sponding to unused subpatterns are set to -1. | sponding to unused subpatterns are set to -1. |
2189 | ||
2190 | Offset values that correspond to unused subpatterns at the end of the | Offset values that correspond to unused subpatterns at the end of the |
2191 | expression are also set to -1. For example, if the string "abc" is | expression are also set to -1. For example, if the string "abc" is |
2192 | matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not | matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
2193 | matched. The return from the function is 2, because the highest used | matched. The return from the function is 2, because the highest used |
2194 | capturing subpattern number is 1. However, you can refer to the offsets | capturing subpattern number is 1. However, you can refer to the offsets |
2195 | for the second and third capturing subpatterns if you wish (assuming | for the second and third capturing subpatterns if you wish (assuming |
2196 | the vector is large enough, of course). | the vector is large enough, of course). |
2197 | ||
2198 | Some convenience functions are provided for extracting the captured | Some convenience functions are provided for extracting the captured |
2199 | substrings as separate strings. These are described below. | substrings as separate strings. These are described below. |
2200 | ||
2201 | Error return values from pcre_exec() | Error return values from pcre_exec() |
2202 | ||
2203 | If pcre_exec() fails, it returns a negative number. The following are | If pcre_exec() fails, it returns a negative number. The following are |
2204 | defined in the header file: | defined in the header file: |
2205 | ||
2206 | PCRE_ERROR_NOMATCH (-1) | PCRE_ERROR_NOMATCH (-1) |
# | Line 2159 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2209 MATCHING A PATTERN: THE TRADITIONAL FUNC |
2209 | ||
2210 | PCRE_ERROR_NULL (-2) | PCRE_ERROR_NULL (-2) |
2211 | ||
2212 | Either code or subject was passed as NULL, or ovector was NULL and | Either code or subject was passed as NULL, or ovector was NULL and |
2213 | ovecsize was not zero. | ovecsize was not zero. |
2214 | ||
2215 | PCRE_ERROR_BADOPTION (-3) | PCRE_ERROR_BADOPTION (-3) |
# | Line 2168 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2218 MATCHING A PATTERN: THE TRADITIONAL FUNC |
2218 | ||
2219 | PCRE_ERROR_BADMAGIC (-4) | PCRE_ERROR_BADMAGIC (-4) |
2220 | ||
2221 | PCRE stores a 4-byte "magic number" at the start of the compiled code, | PCRE stores a 4-byte "magic number" at the start of the compiled code, |
2222 | to catch the case when it is passed a junk pointer and to detect when a | to catch the case when it is passed a junk pointer and to detect when a |
2223 | pattern that was compiled in an environment of one endianness is run in | pattern that was compiled in an environment of one endianness is run in |
2224 | an environment with the other endianness. This is the error that PCRE | an environment with the other endianness. This is the error that PCRE |
2225 | gives when the magic number is not present. | gives when the magic number is not present. |
2226 | ||
2227 | PCRE_ERROR_UNKNOWN_OPCODE (-5) | PCRE_ERROR_UNKNOWN_OPCODE (-5) |
2228 | ||
2229 | While running the pattern match, an unknown item was encountered in the | While running the pattern match, an unknown item was encountered in the |
2230 | compiled pattern. This error could be caused by a bug in PCRE or by | compiled pattern. This error could be caused by a bug in PCRE or by |
2231 | overwriting of the compiled pattern. | overwriting of the compiled pattern. |
2232 | ||
2233 | PCRE_ERROR_NOMEMORY (-6) | PCRE_ERROR_NOMEMORY (-6) |
2234 | ||
2235 | If a pattern contains back references, but the ovector that is passed | If a pattern contains back references, but the ovector that is passed |
2236 | to pcre_exec() is not big enough to remember the referenced substrings, | to pcre_exec() is not big enough to remember the referenced substrings, |
2237 | PCRE gets a block of memory at the start of matching to use for this | PCRE gets a block of memory at the start of matching to use for this |
2238 | purpose. If the call via pcre_malloc() fails, this error is given. The | purpose. If the call via pcre_malloc() fails, this error is given. The |
2239 | memory is automatically freed at the end of matching. | memory is automatically freed at the end of matching. |
2240 | ||
2241 | PCRE_ERROR_NOSUBSTRING (-7) | PCRE_ERROR_NOSUBSTRING (-7) |
2242 | ||
2243 | This error is used by the pcre_copy_substring(), pcre_get_substring(), | This error is used by the pcre_copy_substring(), pcre_get_substring(), |
2244 | and pcre_get_substring_list() functions (see below). It is never | and pcre_get_substring_list() functions (see below). It is never |
2245 | returned by pcre_exec(). | returned by pcre_exec(). |
2246 | ||
2247 | PCRE_ERROR_MATCHLIMIT (-8) | PCRE_ERROR_MATCHLIMIT (-8) |
2248 | ||
2249 | The backtracking limit, as specified by the match_limit field in a | The backtracking limit, as specified by the match_limit field in a |
2250 | pcre_extra structure (or defaulted) was reached. See the description | pcre_extra structure (or defaulted) was reached. See the description |
2251 | above. | above. |
2252 | ||
2253 | PCRE_ERROR_CALLOUT (-9) | PCRE_ERROR_CALLOUT (-9) |
2254 | ||
2255 | This error is never generated by pcre_exec() itself. It is provided for | This error is never generated by pcre_exec() itself. It is provided for |
2256 | use by callout functions that want to yield a distinctive error code. | use by callout functions that want to yield a distinctive error code. |
2257 | See the pcrecallout documentation for details. | See the pcrecallout documentation for details. |
2258 | ||
2259 | PCRE_ERROR_BADUTF8 (-10) | PCRE_ERROR_BADUTF8 (-10) |
2260 | ||
2261 | A string that contains an invalid UTF-8 byte sequence was passed as a | A string that contains an invalid UTF-8 byte sequence was passed as a |
2262 | subject. | subject. |
2263 | ||
2264 | PCRE_ERROR_BADUTF8_OFFSET (-11) | PCRE_ERROR_BADUTF8_OFFSET (-11) |
2265 | ||
2266 | The UTF-8 byte sequence that was passed as a subject was valid, but the | The UTF-8 byte sequence that was passed as a subject was valid, but the |
2267 | value of startoffset did not point to the beginning of a UTF-8 charac- | value of startoffset did not point to the beginning of a UTF-8 charac- |
2268 | ter. | ter. |
2269 | ||
2270 | PCRE_ERROR_PARTIAL (-12) | PCRE_ERROR_PARTIAL (-12) |
2271 | ||
2272 | The subject string did not match, but it did match partially. See the | The subject string did not match, but it did match partially. See the |
2273 | pcrepartial documentation for details of partial matching. | pcrepartial documentation for details of partial matching. |
2274 | ||
2275 | PCRE_ERROR_BADPARTIAL (-13) | PCRE_ERROR_BADPARTIAL (-13) |
2276 | ||
2277 | The PCRE_PARTIAL option was used with a compiled pattern containing | This code is no longer in use. It was formerly returned when the |
2278 | items that are not supported for partial matching. See the pcrepartial | PCRE_PARTIAL option was used with a compiled pattern containing items |
2279 | documentation for details of partial matching. | that were not supported for partial matching. From release 8.00 |
2280 | onwards, there are no restrictions on partial matching. | |
2281 | ||
2282 | PCRE_ERROR_INTERNAL (-14) | PCRE_ERROR_INTERNAL (-14) |
2283 | ||
# | Line 2235 MATCHING A PATTERN: THE TRADITIONAL FUNC | Line 2286 MATCHING A PATTERN: THE TRADITIONAL FUNC |
2286 | ||
2287 | PCRE_ERROR_BADCOUNT (-15) | PCRE_ERROR_BADCOUNT (-15) |
2288 | ||
2289 | This error is given if the value of the ovecsize argument is negative. | This error is given if the value of the ovecsize argument is negative. |
2290 | ||
2291 | PCRE_ERROR_RECURSIONLIMIT (-21) | PCRE_ERROR_RECURSIONLIMIT (-21) |
2292 | ||
2293 | The internal recursion limit, as specified by the match_limit_recursion | The internal recursion limit, as specified by the match_limit_recursion |
2294 | field in a pcre_extra structure (or defaulted) was reached. See the | field in a pcre_extra structure (or defaulted) was reached. See the |
2295 | description above. | description above. |
2296 | ||
2297 | PCRE_ERROR_BADNEWLINE (-23) | PCRE_ERROR_BADNEWLINE (-23) |
# | Line 2263 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER | Line 2314 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
2314 | int pcre_get_substring_list(const char *subject, | int pcre_get_substring_list(const char *subject, |
2315 | int *ovector, int stringcount, const char ***listptr); | int *ovector, int stringcount, const char ***listptr); |
2316 | ||
2317 | Captured substrings can be accessed directly by using the offsets | Captured substrings can be accessed directly by using the offsets |
2318 | returned by pcre_exec() in ovector. For convenience, the functions | returned by pcre_exec() in ovector. For convenience, the functions |
2319 | pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- | pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
2320 | string_list() are provided for extracting captured substrings as new, | string_list() are provided for extracting captured substrings as new, |
2321 | separate, zero-terminated strings. These functions identify substrings | separate, zero-terminated strings. These functions identify substrings |
2322 | by number. The next section describes functions for extracting named | by number. The next section describes functions for extracting named |
2323 | substrings. | substrings. |
2324 | ||
2325 | A substring that contains a binary zero is correctly extracted and has | A substring that contains a binary zero is correctly extracted and has |
2326 | a further zero added on the end, but the result is not, of course, a C | a further zero added on the end, but the result is not, of course, a C |
2327 | string. However, you can process such a string by referring to the | string. However, you can process such a string by referring to the |
2328 | length that is returned by pcre_copy_substring() and pcre_get_sub- | length that is returned by pcre_copy_substring() and pcre_get_sub- |
2329 | string(). Unfortunately, the interface to pcre_get_substring_list() is | string(). Unfortunately, the interface to pcre_get_substring_list() is |
2330 | not adequate for handling strings containing binary zeros, because the | not adequate for handling strings containing binary zeros, because the |
2331 | end of the final string is not independently indicated. | end of the final string is not independently indicated. |
2332 | ||
2333 | The first three arguments are the same for all three of these func- | The first three arguments are the same for all three of these func- |
2334 | tions: subject is the subject string that has just been successfully | tions: subject is the subject string that has just been successfully |
2335 | matched, ovector is a pointer to the vector of integer offsets that was | matched, ovector is a pointer to the vector of integer offsets that was |
2336 | passed to pcre_exec(), and stringcount is the number of substrings that | passed to pcre_exec(), and stringcount is the number of substrings that |
2337 | were captured by the match, including the substring that matched the | were captured by the match, including the substring that matched the |
2338 | entire regular expression. This is the value returned by pcre_exec() if | entire regular expression. This is the value returned by pcre_exec() if |
2339 | it is greater than zero. If pcre_exec() returned zero, indicating that | it is greater than zero. If pcre_exec() returned zero, indicating that |
2340 | it ran out of space in ovector, the value passed as stringcount should | it ran out of space in ovector, the value passed as stringcount should |
2341 | be the number of elements in the vector divided by three. | be the number of elements in the vector divided by three. |
2342 | ||
2343 | The functions pcre_copy_substring() and pcre_get_substring() extract a | The functions pcre_copy_substring() and pcre_get_substring() extract a |
2344 | single substring, whose number is given as stringnumber. A value of | single substring, whose number is given as stringnumber. A value of |
2345 | zero extracts the substring that matched the entire pattern, whereas | zero extracts the substring that matched the entire pattern, whereas |
2346 | higher values extract the captured substrings. For pcre_copy_sub- | higher values extract the captured substrings. For pcre_copy_sub- |
2347 | string(), the string is placed in buffer, whose length is given by | string(), the string is placed in buffer, whose length is given by |
2348 | buffersize, while for pcre_get_substring() a new block of memory is | buffersize, while for pcre_get_substring() a new block of memory is |
2349 | obtained via pcre_malloc, and its address is returned via stringptr. | obtained via pcre_malloc, and its address is returned via stringptr. |
2350 | The yield of the function is the length of the string, not including | The yield of the function is the length of the string, not including |
2351 | the terminating zero, or one of these error codes: | the terminating zero, or one of these error codes: |
2352 | ||
2353 | PCRE_ERROR_NOMEMORY (-6) | PCRE_ERROR_NOMEMORY (-6) |
2354 | ||
2355 | The buffer was too small for pcre_copy_substring(), or the attempt to | The buffer was too small for pcre_copy_substring(), or the attempt to |
2356 | get memory failed for pcre_get_substring(). | get memory failed for pcre_get_substring(). |
2357 | ||
2358 | PCRE_ERROR_NOSUBSTRING (-7) | PCRE_ERROR_NOSUBSTRING (-7) |
2359 | ||
2360 | There is no substring whose number is stringnumber. | There is no substring whose number is stringnumber. |
2361 | ||
2362 | The pcre_get_substring_list() function extracts all available sub- | The pcre_get_substring_list() function extracts all available sub- |
2363 | strings and builds a list of pointers to them. All this is done in a | strings and builds a list of pointers to them. All this is done in a |
2364 | single block of memory that is obtained via pcre_malloc. The address of | single block of memory that is obtained via pcre_malloc. The address of |
2365 | the memory block is returned via listptr, which is also the start of | the memory block is returned via listptr, which is also the start of |
2366 | the list of string pointers. The end of the list is marked by a NULL | the list of string pointers. The end of the list is marked by a NULL |
2367 | pointer. The yield of the function is zero if all went well, or the | pointer. The yield of the function is zero if all went well, or the |
2368 | error code | error code |
2369 | ||
2370 | PCRE_ERROR_NOMEMORY (-6) | PCRE_ERROR_NOMEMORY (-6) |
2371 | ||
2372 | if the attempt to get the memory block failed. | if the attempt to get the memory block failed. |
2373 | ||
2374 | When any of these functions encounter a substring that is unset, which | When any of these functions encounter a substring that is unset, which |
2375 | can happen when capturing subpattern number n+1 matches some part of | can happen when capturing subpattern number n+1 matches some part of |
2376 | the subject, but subpattern n has not been used at all, they return an | the subject, but subpattern n has not been used at all, they return an |
2377 | empty string. This can be distinguished from a genuine zero-length sub- | empty string. This can be distinguished from a genuine zero-length sub- |
2378 | string by inspecting the appropriate offset in ovector, which is nega- | string by inspecting the appropriate offset in ovector, which is nega- |
2379 | tive for unset substrings. | tive for unset substrings. |
2380 | ||
2381 | The two convenience functions pcre_free_substring() and pcre_free_sub- | The two convenience functions pcre_free_substring() and pcre_free_sub- |
2382 | string_list() can be used to free the memory returned by a previous | string_list() can be used to free the memory returned by a previous |
2383 | call of pcre_get_substring() or pcre_get_substring_list(), respec- | call of pcre_get_substring() or pcre_get_substring_list(), respec- |
2384 | tively. They do nothing more than call the function pointed to by | tively. They do nothing more than call the function pointed to by |
2385 | pcre_free, which of course could be called directly from a C program. | pcre_free, which of course could be called directly from a C program. |
2386 | However, PCRE is used in some situations where it is linked via a spe- | However, PCRE is used in some situations where it is linked via a spe- |
2387 | cial interface to another programming language that cannot use | cial interface to another programming language that cannot use |
2388 | pcre_free directly; it is for these cases that the functions are pro- | pcre_free directly; it is for these cases that the functions are pro- |
2389 | vided. | vided. |
2390 | ||
2391 | ||
# | Line 2353 EXTRACTING CAPTURED SUBSTRINGS BY NAME | Line 2404 EXTRACTING CAPTURED SUBSTRINGS BY NAME |
2404 | int stringcount, const char *stringname, | int stringcount, const char *stringname, |
2405 | const char **stringptr); | const char **stringptr); |
2406 | ||
2407 | To extract a substring by name, you first have to find associated num- | To extract a substring by name, you first have to find associated num- |
2408 | ber. For example, for this pattern | ber. For example, for this pattern |
2409 | ||
2410 | (a+)b(?<xxx>\d+)... | (a+)b(?<xxx>\d+)... |
# | Line 2362 EXTRACTING CAPTURED SUBSTRINGS BY NAME | Line 2413 EXTRACTING CAPTURED SUBSTRINGS BY NAME |
2413 | be unique (PCRE_DUPNAMES was not set), you can find the number from the | be unique (PCRE_DUPNAMES was not set), you can find the number from the |
2414 | name by calling pcre_get_stringnumber(). The first argument is the com- | name by calling pcre_get_stringnumber(). The first argument is the com- |
2415 | piled pattern, and the second is the name. The yield of the function is | piled pattern, and the second is the name. The yield of the function is |
2416 | the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no | the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
2417 | subpattern of that name. | subpattern of that name. |
2418 | ||
2419 | Given the number, you can extract the substring directly, or use one of | Given the number, you can extract the substring directly, or use one of |
2420 | the functions described in the previous section. For convenience, there | the functions described in the previous section. For convenience, there |
2421 | are also two functions that do the whole job. | are also two functions that do the whole job. |
2422 | ||
2423 | Most of the arguments of pcre_copy_named_substring() and | Most of the arguments of pcre_copy_named_substring() and |
2424 | pcre_get_named_substring() are the same as those for the similarly | pcre_get_named_substring() are the same as those for the similarly |
2425 | named functions that extract by number. As these are described in the | named functions that extract by number. As these are described in the |
2426 | previous section, they are not re-described here. There are just two | previous section, they are not re-described here. There are just two |
2427 | differences: | differences: |
2428 | ||
2429 | First, instead of a substring number, a substring name is given. Sec- | First, instead of a substring number, a substring name is given. Sec- |
2430 | ond, there is an extra argument, given at the start, which is a pointer | ond, there is an extra argument, given at the start, which is a pointer |
2431 | to the compiled pattern. This is needed in order to gain access to the | to the compiled pattern. This is needed in order to gain access to the |
2432 | name-to-number translation table. | name-to-number translation table. |
2433 | ||
2434 | These functions call pcre_get_stringnumber(), and if it succeeds, they | These functions call pcre_get_stringnumber(), and if it succeeds, they |
2435 | then call pcre_copy_substring() or pcre_get_substring(), as appropri- | then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
2436 | ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the | ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
2437 | behaviour may not be what you want (see the next section). | behaviour may not be what you want (see the next section). |
2438 | ||
2439 | Warning: If the pattern uses the "(?|" feature to set up multiple sub- | |
2440 | patterns with the same number, you cannot use names to distinguish | |
2441 | them, because names are not included in the compiled code. The matching | |
2442 | process uses only numbers. | |
2443 | ||
2444 | ||
2445 | DUPLICATE SUBPATTERN NAMES | DUPLICATE SUBPATTERN NAMES |
2446 | ||
# | Line 2448 MATCHING A PATTERN: THE ALTERNATIVE FUNC | Line 2504 MATCHING A PATTERN: THE ALTERNATIVE FUNC |
2504 | characteristics to the normal algorithm, and is not compatible with | characteristics to the normal algorithm, and is not compatible with |
2505 | Perl. Some of the features of PCRE patterns are not supported. Never- | Perl. Some of the features of PCRE patterns are not supported. Never- |
2506 | theless, there are times when this kind of matching can be useful. For | theless, there are times when this kind of matching can be useful. For |
2507 | a discussion of the two matching algorithms, see the pcrematching docu- | a discussion of the two matching algorithms, and a list of features |
2508 | mentation. | that pcre_dfa_exec() does not support, see the pcrematching documenta- |
2509 | tion. | |
2510 | ||
2511 | The arguments for the pcre_dfa_exec() function are the same as for | The arguments for the pcre_dfa_exec() function are the same as for |
2512 | pcre_exec(), plus two extras. The ovector argument is used in a differ- | pcre_exec(), plus two extras. The ovector argument is used in a differ- |
2513 | ent way, and this is described below. The other common arguments are | ent way, and this is described below. The other common arguments are |
2514 | used in the same way as for pcre_exec(), so their description is not | used in the same way as for pcre_exec(), so their description is not |
2515 | repeated here. | repeated here. |
2516 | ||
2517 | The two additional arguments provide workspace for the function. The | The two additional arguments provide workspace for the function. The |
2518 | workspace vector should contain at least 20 elements. It is used for | workspace vector should contain at least 20 elements. It is used for |
2519 | keeping track of multiple paths through the pattern tree. More | keeping track of multiple paths through the pattern tree. More |
2520 | workspace will be needed for patterns and subjects where there are a | workspace will be needed for patterns and subjects where there are a |
2521 | lot of potential matches. | lot of potential matches. |
2522 | ||
2523 | Here is an example of a simple call to pcre_dfa_exec(): | Here is an example of a simple call to pcre_dfa_exec(): |
# | Line 2482 MATCHING A PATTERN: THE ALTERNATIVE FUNC | Line 2539 MATCHING A PATTERN: THE ALTERNATIVE FUNC |
2539 | ||
2540 | Option bits for pcre_dfa_exec() | Option bits for pcre_dfa_exec() |
2541 | ||
2542 | The unused bits of the options argument for pcre_dfa_exec() must be | The unused bits of the options argument for pcre_dfa_exec() must be |
2543 | zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- | zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
2544 | LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, | LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, |
2545 | PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last | PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR- |
2546 | three of these are the same as for pcre_exec(), so their description is | TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
2547 | not repeated here. | four of these are exactly the same as for pcre_exec(), so their |
2548 | description is not repeated here. | |
2549 | PCRE_PARTIAL | |
2550 | PCRE_PARTIAL_HARD | |
2551 | This has the same general effect as it does for pcre_exec(), but the | PCRE_PARTIAL_SOFT |
2552 | details are slightly different. When PCRE_PARTIAL is set for | |
2553 | pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into | These have the same general effect as they do for pcre_exec(), but the |
2554 | PCRE_ERROR_PARTIAL if the end of the subject is reached, there have | details are slightly different. When PCRE_PARTIAL_HARD is set for |
2555 | been no complete matches, but there is still at least one matching pos- | pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub- |
2556 | sibility. The portion of the string that provided the partial match is | ject is reached and there is still at least one matching possibility |
2557 | set as the first matching string. | that requires additional characters. This happens even if some complete |
2558 | matches have also been found. When PCRE_PARTIAL_SOFT is set, the return | |
2559 | code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end | |
2560 | of the subject is reached, there have been no complete matches, but | |
2561 | there is still at least one matching possibility. The portion of the | |
2562 | string that was inspected when the longest partial match was found is | |
2563 | set as the first matching string in both cases. | |
2564 | ||
2565 | PCRE_DFA_SHORTEST | PCRE_DFA_SHORTEST |
2566 | ||
# | Line 2508 MATCHING A PATTERN: THE ALTERNATIVE FUNC | Line 2571 MATCHING A PATTERN: THE ALTERNATIVE FUNC |
2571 | ||
2572 | PCRE_DFA_RESTART | PCRE_DFA_RESTART |
2573 | ||
2574 | When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and | When pcre_dfa_exec() returns a partial match, it is possible to call it |
2575 | returns a partial match, it is possible to call it again, with addi- | again, with additional subject characters, and have it continue with |
2576 | tional subject characters, and have it continue with the same match. | the same match. The PCRE_DFA_RESTART option requests this action; when |
2577 | The PCRE_DFA_RESTART option requests this action; when it is set, the | it is set, the workspace and wscount options must reference the same |
2578 | workspace and wscount options must reference the same vector as before | vector as before because data about the match so far is left in them |
2579 | because data about the match so far is left in them after a partial | after a partial match. There is more discussion of this facility in the |
2580 | match. There is more discussion of this facility in the pcrepartial | pcrepartial documentation. |
documentation. | ||
2581 | ||
2582 | Successful returns from pcre_dfa_exec() | Successful returns from pcre_dfa_exec() |
2583 | ||
2584 | When pcre_dfa_exec() succeeds, it may have matched more than one sub- | When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
2585 | string in the subject. Note, however, that all the matches from one run | string in the subject. Note, however, that all the matches from one run |
2586 | of the function start at the same point in the subject. The shorter | of the function start at the same point in the subject. The shorter |
2587 | matches are all initial substrings of the longer matches. For example, | matches are all initial substrings of the longer matches. For example, |
2588 | if the pattern | if the pattern |
2589 | ||
2590 | <.*> | <.*> |
# | Line 2537 MATCHING A PATTERN: THE ALTERNATIVE FUNC | Line 2599 MATCHING A PATTERN: THE ALTERNATIVE FUNC |
2599 | <something> <something else> | <something> <something else> |
2600 | <something> <something else> <something further> | <something> <something else> <something further> |
2601 | ||
2602 | On success, the yield of the function is a number greater than zero, | On success, the yield of the function is a number greater than zero, |
2603 | which is the number of matched substrings. The substrings themselves | which is the number of matched substrings. The substrings themselves |
2604 | are returned in ovector. Each string uses two elements; the first is | are returned in ovector. Each string uses two elements; the first is |
2605 | the offset to the start, and the second is the offset to the end. In | the offset to the start, and the second is the offset to the end. In |
2606 | fact, all the strings have the same start offset. (Space could have | fact, all the strings have the same start offset. (Space could have |
2607 | been saved by giving this only once, but it was decided to retain some | been saved by giving this only once, but it was decided to retain some |
2608 | compatibility with the way pcre_exec() returns data, even though the | compatibility with the way pcre_exec() returns data, even though the |
2609 | meaning of the strings is different.) | meaning of the strings is different.) |
2610 | ||
2611 | The strings are returned in reverse order of length; that is, the long- | The strings are returned in reverse order of length; that is, the long- |
2612 | est matching string is given first. If there were too many matches to | est matching string is given first. If there were too many matches to |
2613 | fit into ovector, the yield of the function is zero, and the vector is | fit into ovector, the yield of the function is zero, and the vector is |
2614 | filled with the longest matches. | filled with the longest matches. |
2615 | ||
2616 | Error returns from pcre_dfa_exec() | Error returns from pcre_dfa_exec() |
2617 | ||
2618 | The pcre_dfa_exec() function returns a negative number when it fails. | The pcre_dfa_exec() function returns a negative number when it fails. |
2619 | Many of the errors are the same as for pcre_exec(), and these are | Many of the errors are the same as for pcre_exec(), and these are |
2620 | described above. There are in addition the following errors that are | described above. There are in addition the following errors that are |
2621 | specific to pcre_dfa_exec(): | specific to pcre_dfa_exec(): |
2622 | ||
2623 | PCRE_ERROR_DFA_UITEM (-16) | PCRE_ERROR_DFA_UITEM (-16) |
2624 | ||
2625 | This return is given if pcre_dfa_exec() encounters an item in the pat- | This return is given if pcre_dfa_exec() encounters an item in the pat- |
2626 | tern that it does not support, for instance, the use of \C or a back | tern that it does not support, for instance, the use of \C or a back |
2627 | reference. | reference. |
2628 | ||
2629 | PCRE_ERROR_DFA_UCOND (-17) | PCRE_ERROR_DFA_UCOND (-17) |
2630 | ||
2631 | This return is given if pcre_dfa_exec() encounters a condition item | This return is given if pcre_dfa_exec() encounters a condition item |
2632 | that uses a back reference for the condition, or a test for recursion | that uses a back reference for the condition, or a test for recursion |
2633 | in a specific group. These are not supported. | in a specific group. These are not supported. |
2634 | ||
2635 | PCRE_ERROR_DFA_UMLIMIT (-18) | PCRE_ERROR_DFA_UMLIMIT (-18) |
2636 | ||
2637 | This return is given if pcre_dfa_exec() is called with an extra block | This return is given if pcre_dfa_exec() is called with an extra block |
2638 | that contains a setting of the match_limit field. This is not supported | that contains a setting of the match_limit field. This is not supported |
2639 | (it is meaningless). | (it is meaningless). |
2640 | ||
2641 | PCRE_ERROR_DFA_WSSIZE (-19) | PCRE_ERROR_DFA_WSSIZE (-19) |
2642 | ||
2643 | This return is given if pcre_dfa_exec() runs out of space in the | This return is given if pcre_dfa_exec() runs out of space in the |
2644 | workspace vector. | workspace vector. |
2645 | ||
2646 | PCRE_ERROR_DFA_RECURSE (-20) | PCRE_ERROR_DFA_RECURSE (-20) |
2647 | ||
2648 | When a recursive subpattern is processed, the matching function calls | When a recursive subpattern is processed, the matching function calls |
2649 | itself recursively, using private vectors for ovector and workspace. | itself recursively, using private vectors for ovector and workspace. |
2650 | This error is given if the output vector is not large enough. This | This error is given if the output vector is not large enough. This |
2651 | should be extremely rare, as a vector of size 1000 is used. | should be extremely rare, as a vector of size 1000 is used. |
2652 | ||
2653 | ||
2654 | SEE ALSO | SEE ALSO |
2655 | ||
2656 | pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- | pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
2657 | tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). | tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
2658 | ||
2659 | ||
2660 | AUTHOR | AUTHOR |
# | Line 2604 AUTHOR | Line 2666 AUTHOR |
2666 | ||
2667 | REVISION | REVISION |
2668 | ||
2669 | Last updated: 12 April 2008 | Last updated: 11 September 2009 |
2670 | Copyright (c) 1997-2008 University of Cambridge. | Copyright (c) 1997-2009 University of Cambridge. |
2671 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
2672 | ||
2673 | ||
2674 | PCRECALLOUT(3) PCRECALLOUT(3) | PCRECALLOUT(3) PCRECALLOUT(3) |
2675 | ||
2676 | ||
# | Line 2656 PCRE CALLOUTS | Line 2718 PCRE CALLOUTS |
2718 | MISSING CALLOUTS | MISSING CALLOUTS |
2719 | ||
2720 | You should be aware that, because of optimizations in the way PCRE | You should be aware that, because of optimizations in the way PCRE |
2721 | matches patterns, callouts sometimes do not happen. For example, if the | matches patterns by default, callouts sometimes do not happen. For |
2722 | pattern is | example, if the pattern is |
2723 | ||
2724 | ab(?C4)cd | ab(?C4)cd |
2725 | ||
# | Line 2666 MISSING CALLOUTS | Line 2728 MISSING CALLOUTS |
2728 | ever start, and the callout is never reached. However, with "abyd", | ever start, and the callout is never reached. However, with "abyd", |
2729 | though the result is still no match, the callout is obeyed. | though the result is still no match, the callout is obeyed. |
2730 | ||
2731 | You can disable these optimizations by passing the PCRE_NO_START_OPTI- | |
2732 | MIZE option to pcre_exec() or pcre_dfa_exec(). This slows down the | |
2733 | matching process, but does ensure that callouts such as the example | |
2734 | above are obeyed. | |
2735 | ||
2736 | ||
2737 | THE CALLOUT INTERFACE | THE CALLOUT INTERFACE |
2738 | ||
2739 | During matching, when PCRE reaches a callout point, the external func- | During matching, when PCRE reaches a callout point, the external func- |
2740 | tion defined by pcre_callout is called (if it is set). This applies to | tion defined by pcre_callout is called (if it is set). This applies to |
2741 | both the pcre_exec() and the pcre_dfa_exec() matching functions. The | both the pcre_exec() and the pcre_dfa_exec() matching functions. The |
2742 | only argument to the callout function is a pointer to a pcre_callout | only argument to the callout function is a pointer to a pcre_callout |
2743 | block. This structure contains the following fields: | block. This structure contains the following fields: |
2744 | ||
2745 | int version; | int version; |
# | Line 2688 THE CALLOUT INTERFACE | Line 2755 THE CALLOUT INTERFACE |
2755 | int pattern_position; | int pattern_position; |
2756 | int next_item_length; | int next_item_length; |
2757 | ||
2758 | The version field is an integer containing the version number of the | The version field is an integer containing the version number of the |
2759 | block format. The initial version was 0; the current version is 1. The | block format. The initial version was 0; the current version is 1. The |
2760 | version number will change again in future if additional fields are | version number will change again in future if additional fields are |
2761 | added, but the intention is never to remove any of the existing fields. | added, but the intention is never to remove any of the existing fields. |
2762 | ||
2763 | The callout_number field contains the number of the callout, as com- | The callout_number field contains the number of the callout, as com- |
# | Line 2775 AUTHOR | Line 2842 AUTHOR |
2842 | ||
2843 | REVISION | REVISION |
2844 | ||
2845 | Last updated: 29 May 2007 | Last updated: 15 March 2009 |
2846 | Copyright (c) 1997-2007 University of Cambridge. | Copyright (c) 1997-2009 University of Cambridge. |
2847 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
2848 | ||
2849 | ||
2850 | PCRECOMPAT(3) PCRECOMPAT(3) | PCRECOMPAT(3) PCRECOMPAT(3) |
2851 | ||
2852 | ||
# | Line 2792 DIFFERENCES BETWEEN PCRE AND PERL | Line 2859 DIFFERENCES BETWEEN PCRE AND PERL |
2859 | This document describes the differences in the ways that PCRE and Perl | This document describes the differences in the ways that PCRE and Perl |
2860 | handle regular expressions. The differences described here are mainly | handle regular expressions. The differences described here are mainly |
2861 | with respect to Perl 5.8, though PCRE versions 7.0 and later contain | with respect to Perl 5.8, though PCRE versions 7.0 and later contain |
2862 | some features that are expected to be in the forthcoming Perl 5.10. | some features that are in Perl 5.10. |
2863 | ||
2864 | 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details | 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
2865 | of what it does have are given in the section on UTF-8 support in the | of what it does have are given in the section on UTF-8 support in the |
# | Line 2824 DIFFERENCES BETWEEN PCRE AND PERL | Line 2891 DIFFERENCES BETWEEN PCRE AND PERL |
2891 | is built with Unicode character property support. The properties that | is built with Unicode character property support. The properties that |
2892 | can be tested with \p and \P are limited to the general category prop- | can be tested with \p and \P are limited to the general category prop- |
2893 | erties such as Lu and Nd, script names such as Greek or Han, and the | erties such as Lu and Nd, script names such as Greek or Han, and the |
2894 | derived properties Any and L&. | derived properties Any and L&. PCRE does support the Cs (surrogate) |
2895 | property, which Perl does not; the Perl documentation says "Because | |
2896 | Perl hides the need for the user to understand the internal representa- | |
2897 | tion of Unicode characters, there is no need to implement the somewhat | |
2898 | messy concept of surrogates." | |
2899 | ||
2900 | 7. PCRE does support the \Q...\E escape for quoting substrings. Charac- | 7. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
2901 | ters in between are treated as literals. This is slightly different | ters in between are treated as literals. This is slightly different |
# | Line 2844 DIFFERENCES BETWEEN PCRE AND PERL | Line 2915 DIFFERENCES BETWEEN PCRE AND PERL |
2915 | ||
2916 | 8. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) | 8. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) |
2917 | constructions. However, there is support for recursive patterns. This | constructions. However, there is support for recursive patterns. This |
2918 | is not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE | is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE |
2919 | "callout" feature allows an external function to be called during pat- | "callout" feature allows an external function to be called during pat- |
2920 | tern matching. See the pcrecallout documentation for details. | tern matching. See the pcrecallout documentation for details. |
2921 | ||
2922 | 9. Subpatterns that are called recursively or as "subroutines" are | 9. Subpatterns that are called recursively or as "subroutines" are |
2923 | always treated as atomic groups in PCRE. This is like Python, but | always treated as atomic groups in PCRE. This is like Python, but |
2924 | unlike Perl. | unlike Perl. There is a discussion of an example that explains this in |
2925 | more detail in the section on recursion differences from Perl in the | |
2926 | pcrecompat page. | |
2927 | ||
2928 | 10. There are some differences that are concerned with the settings of | 10. There are some differences that are concerned with the settings of |
2929 | captured strings when part of a pattern is repeated. For example, | captured strings when part of a pattern is repeated. For example, |
# | Line 2859 DIFFERENCES BETWEEN PCRE AND PERL | Line 2932 DIFFERENCES BETWEEN PCRE AND PERL |
2932 | ||
2933 | 11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), | 11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), |
2934 | (*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in | (*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in |
2935 | the forms without an argument. PCRE does not support (*MARK). If | the forms without an argument. PCRE does not support (*MARK). |
(*ACCEPT) is within capturing parentheses, PCRE does not set that cap- | ||
ture group; this is different to Perl. | ||
2936 | ||
2937 | 12. PCRE provides some extensions to the Perl regular expression facil- | 12. PCRE provides some extensions to the Perl regular expression facil- |
2938 | ities. Perl 5.10 will include new features that are not in earlier | ities. Perl 5.10 will include new features that are not in earlier |
# | Line 2886 DIFFERENCES BETWEEN PCRE AND PERL | Line 2957 DIFFERENCES BETWEEN PCRE AND PERL |
2957 | (e) PCRE_ANCHORED can be used at matching time to force a pattern to be | (e) PCRE_ANCHORED can be used at matching time to force a pattern to be |
2958 | tried only at the first matching position in the subject string. | tried only at the first matching position in the subject string. |
2959 | ||
2960 | (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP- | (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, |
2961 | TURE options for pcre_exec() have no Perl equivalents. | and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva- |
2962 | lents. | |
2963 | ||
2964 | (g) The \R escape sequence can be restricted to match only CR, LF, or | (g) The \R escape sequence can be restricted to match only CR, LF, or |
2965 | CRLF by the PCRE_BSR_ANYCRLF option. | CRLF by the PCRE_BSR_ANYCRLF option. |
2966 | ||
2967 | (h) The callout facility is PCRE-specific. | (h) The callout facility is PCRE-specific. |
# | Line 2899 DIFFERENCES BETWEEN PCRE AND PERL | Line 2971 DIFFERENCES BETWEEN PCRE AND PERL |
2971 | (j) Patterns compiled by PCRE can be saved and re-used at a later time, | (j) Patterns compiled by PCRE can be saved and re-used at a later time, |
2972 | even on different hosts that have the other endianness. | even on different hosts that have the other endianness. |
2973 | ||
2974 | (k) The alternative matching function (pcre_dfa_exec()) matches in a | (k) The alternative matching function (pcre_dfa_exec()) matches in a |
2975 | different way and is not Perl-compatible. | different way and is not Perl-compatible. |
2976 | ||
2977 | (l) PCRE recognizes some special sequences such as (*CR) at the start | (l) PCRE recognizes some special sequences such as (*CR) at the start |
2978 | of a pattern that set overall options that cannot be changed within the | of a pattern that set overall options that cannot be changed within the |
2979 | pattern. | pattern. |
2980 | ||
# | Line 2916 AUTHOR | Line 2988 AUTHOR |
2988 | ||
2989 | REVISION | REVISION |
2990 | ||
2991 | Last updated: 11 September 2007 | Last updated: 18 September 2009 |
2992 | Copyright (c) 1997-2007 University of Cambridge. | Copyright (c) 1997-2009 University of Cambridge. |
2993 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
2994 | ||
2995 | ||
2996 | PCREPATTERN(3) PCREPATTERN(3) | PCREPATTERN(3) PCREPATTERN(3) |
2997 | ||
2998 | ||
# | Line 2948 PCRE REGULAR EXPRESSION DETAILS | Line 3020 PCRE REGULAR EXPRESSION DETAILS |
3020 | The original operation of PCRE was on strings of one-byte characters. | The original operation of PCRE was on strings of one-byte characters. |
3021 | However, there is now also support for UTF-8 character strings. To use | However, there is now also support for UTF-8 character strings. To use |
3022 | this, you must build PCRE to include UTF-8 support, and then call | this, you must build PCRE to include UTF-8 support, and then call |
3023 | pcre_compile() with the PCRE_UTF8 option. How this affects pattern | pcre_compile() with the PCRE_UTF8 option. There is also a special |
3024 | matching is mentioned in several places below. There is also a summary | sequence that can be given at the start of a pattern: |
3025 | of UTF-8 features in the section on UTF-8 support in the main pcre | |
3026 | page. | (*UTF8) |
3027 | ||
3028 | Starting a pattern with this sequence is equivalent to setting the | |
3029 | PCRE_UTF8 option. This feature is not Perl-compatible. How setting | |
3030 | UTF-8 mode affects pattern matching is mentioned in several places | |
3031 | below. There is also a summary of UTF-8 features in the section on | |
3032 | UTF-8 support in the main pcre page. | |
3033 | ||
3034 | The remainder of this document discusses the patterns that are sup- | The remainder of this document discusses the patterns that are sup- |
3035 | ported by PCRE when its main matching function, pcre_exec(), is used. | ported by PCRE when its main matching function, pcre_exec(), is used. |
# | Line 3055 CHARACTERS AND METACHARACTERS | Line 3133 CHARACTERS AND METACHARACTERS |
3133 | syntax) | syntax) |
3134 | ] terminates the character class | ] terminates the character class |
3135 | ||
3136 | The following sections describe the use of each of the metacharacters. | The following sections describe the use of each of the metacharacters. |
3137 | ||
3138 | ||
3139 | BACKSLASH | BACKSLASH |
3140 | ||
3141 | The backslash character has several uses. Firstly, if it is followed by | The backslash character has several uses. Firstly, if it is followed by |
3142 | a non-alphanumeric character, it takes away any special meaning that | a non-alphanumeric character, it takes away any special meaning that |
3143 | character may have. This use of backslash as an escape character | character may have. This use of backslash as an escape character |
3144 | applies both inside and outside character classes. | applies both inside and outside character classes. |
3145 | ||
3146 | For example, if you want to match a * character, you write \* in the | For example, if you want to match a * character, you write \* in the |
3147 | pattern. This escaping action applies whether or not the following | pattern. This escaping action applies whether or not the following |
3148 | character would otherwise be interpreted as a metacharacter, so it is | character would otherwise be interpreted as a metacharacter, so it is |
3149 | always safe to precede a non-alphanumeric with backslash to specify | always safe to precede a non-alphanumeric with backslash to specify |
3150 | that it stands for itself. In particular, if you want to match a back- | that it stands for itself. In particular, if you want to match a back- |
3151 | slash, you write \\. | slash, you write \\. |
3152 | ||
3153 | If a pattern is compiled with the PCRE_EXTENDED option, whitespace in | If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
3154 | the pattern (other than in a character class) and characters between a | the pattern (other than in a character class) and characters between a |
3155 | # outside a character class and the next newline are ignored. An escap- | # outside a character class and the next newline are ignored. An escap- |
3156 | ing backslash can be used to include a whitespace or # character as | ing backslash can be used to include a whitespace or # character as |
3157 | part of the pattern. | part of the pattern. |
3158 | ||
3159 | If you want to remove the special meaning from a sequence of charac- | If you want to remove the special meaning from a sequence of charac- |
3160 | ters, you can do so by putting them between \Q and \E. This is differ- | ters, you can do so by putting them between \Q and \E. This is differ- |
3161 | ent from Perl in that $ and @ are handled as literals in \Q...\E | ent from Perl in that $ and @ are handled as literals in \Q...\E |
3162 | sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- | sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
3163 | tion. Note the following examples: | tion. Note the following examples: |
3164 | ||
3165 | Pattern PCRE matches Perl matches | Pattern PCRE matches Perl matches |
# | Line 3091 BACKSLASH | Line 3169 BACKSLASH |
3169 | \Qabc\$xyz\E abc\$xyz abc\$xyz | \Qabc\$xyz\E abc\$xyz abc\$xyz |
3170 | \Qabc\E\$\Qxyz\E abc$xyz abc$xyz | \Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
3171 | ||
3172 | The \Q...\E sequence is recognized both inside and outside character | The \Q...\E sequence is recognized both inside and outside character |
3173 | classes. | classes. |
3174 | ||
3175 | Non-printing characters | Non-printing characters |
3176 | ||
3177 | A second use of backslash provides a way of encoding non-printing char- | A second use of backslash provides a way of encoding non-printing char- |
3178 | acters in patterns in a visible manner. There is no restriction on the | acters in patterns in a visible manner. There is no restriction on the |
3179 | appearance of non-printing characters, apart from the binary zero that | appearance of non-printing characters, apart from the binary zero that |
3180 | terminates a pattern, but when a pattern is being prepared by text | terminates a pattern, but when a pattern is being prepared by text |
3181 | editing, it is usually easier to use one of the following escape | editing, it is usually easier to use one of the following escape |
3182 | sequences than the binary character it represents: | sequences than the binary character it represents: |
3183 | ||
3184 | \a alarm, that is, the BEL character (hex 07) | \a alarm, that is, the BEL character (hex 07) |
# | Line 3114 BACKSLASH | Line 3192 BACKSLASH |
3192 | \xhh character with hex code hh | \xhh character with hex code hh |
3193 | \x{hhh..} character with hex code hhh.. | \x{hhh..} character with hex code hhh.. |
3194 | ||
3195 | The precise effect of \cx is as follows: if x is a lower case letter, | The precise effect of \cx is as follows: if x is a lower case letter, |
3196 | it is converted to upper case. Then bit 6 of the character (hex 40) is | it is converted to upper case. Then bit 6 of the character (hex 40) is |
3197 | inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; | inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
3198 | becomes hex 7B. | becomes hex 7B. |
3199 | ||
3200 | After \x, from zero to two hexadecimal digits are read (letters can be | After \x, from zero to two hexadecimal digits are read (letters can be |
3201 | in upper or lower case). Any number of hexadecimal digits may appear | in upper or lower case). Any number of hexadecimal digits may appear |
3202 | between \x{ and }, but the value of the character code must be less | between \x{ and }, but the value of the character code must be less |
3203 | than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, | than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
3204 | the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger | the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
3205 | than the largest Unicode code point, which is 10FFFF. | than the largest Unicode code point, which is 10FFFF. |
3206 | ||
3207 | If characters other than hexadecimal digits appear between \x{ and }, | If characters other than hexadecimal digits appear between \x{ and }, |
3208 | or if there is no terminating }, this form of escape is not recognized. | or if there is no terminating }, this form of escape is not recognized. |
3209 | Instead, the initial \x will be interpreted as a basic hexadecimal | Instead, the initial \x will be interpreted as a basic hexadecimal |
3210 | escape, with no following digits, giving a character whose value is | escape, with no following digits, giving a character whose value is |
3211 | zero. | zero. |
3212 | ||
3213 | Characters whose value is less than 256 can be defined by either of the | Characters whose value is less than 256 can be defined by either of the |
3214 | two syntaxes for \x. There is no difference in the way they are han- | two syntaxes for \x. There is no difference in the way they are han- |
3215 | dled. For example, \xdc is exactly the same as \x{dc}. | dled. For example, \xdc is exactly the same as \x{dc}. |
3216 | ||
3217 | After \0 up to two further octal digits are read. If there are fewer | After \0 up to two further octal digits are read. If there are fewer |
3218 | than two digits, just those that are present are used. Thus the | than two digits, just those that are present are used. Thus the |
3219 | sequence \0\x\07 specifies two binary zeros followed by a BEL character | sequence \0\x\07 specifies two binary zeros followed by a BEL character |
3220 | (code value 7). Make sure you supply two digits after the initial zero | (code value 7). Make sure you supply two digits after the initial zero |
3221 | if the pattern character that follows is itself an octal digit. | if the pattern character that follows is itself an octal digit. |
3222 | ||
3223 | The handling of a backslash followed by a digit other than 0 is compli- | The handling of a backslash followed by a digit other than 0 is compli- |
3224 | cated. Outside a character class, PCRE reads it and any following dig- | cated. Outside a character class, PCRE reads it and any following dig- |
3225 | its as a decimal number. If the number is less than 10, or if there | its as a decimal number. If the number is less than 10, or if there |
3226 | have been at least that many previous capturing left parentheses in the | have been at least that many previous capturing left parentheses in the |
3227 | expression, the entire sequence is taken as a back reference. A | expression, the entire sequence is taken as a back reference. A |
3228 | description of how this works is given later, following the discussion | description of how this works is given later, following the discussion |
3229 | of parenthesized subpatterns. | of parenthesized subpatterns. |
3230 | ||
3231 | Inside a character class, or if the decimal number is greater than 9 | Inside a character class, or if the decimal number is greater than 9 |
3232 | and there have not been that many capturing subpatterns, PCRE re-reads | and there have not been that many capturing subpatterns, PCRE re-reads |
3233 | up to three octal digits following the backslash, and uses them to gen- | up to three octal digits following the backslash, and uses them to gen- |
3234 | erate a data character. Any subsequent digits stand for themselves. In | erate a data character. Any subsequent digits stand for themselves. In |
3235 | non-UTF-8 mode, the value of a character specified in octal must be | non-UTF-8 mode, the value of a character specified in octal must be |
3236 | less than \400. In UTF-8 mode, values up to \777 are permitted. For | less than \400. In UTF-8 mode, values up to \777 are permitted. For |
3237 | example: | example: |
3238 | ||
3239 | \040 is another way of writing a space | \040 is another way of writing a space |
# | Line 3173 BACKSLASH | Line 3251 BACKSLASH |
3251 | \81 is either a back reference, or a binary zero | \81 is either a back reference, or a binary zero |
3252 | followed by the two characters "8" and "1" | followed by the two characters "8" and "1" |
3253 | ||
3254 | Note that octal values of 100 or greater must not be introduced by a | Note that octal values of 100 or greater must not be introduced by a |
3255 | leading zero, because no more than three octal digits are ever read. | leading zero, because no more than three octal digits are ever read. |
3256 | ||
3257 | All the sequences that define a single character value can be used both | All the sequences that define a single character value can be used both |
3258 | inside and outside character classes. In addition, inside a character | inside and outside character classes. In addition, inside a character |
3259 | class, the sequence \b is interpreted as the backspace character (hex | class, the sequence \b is interpreted as the backspace character (hex |
3260 | 08), and the sequences \R and \X are interpreted as the characters "R" | 08), and the sequences \R and \X are interpreted as the characters "R" |
3261 | and "X", respectively. Outside a character class, these sequences have | and "X", respectively. Outside a character class, these sequences have |
3262 | different meanings (see below). | different meanings (see below). |
3263 | ||
3264 | Absolute and relative back references | Absolute and relative back references |
3265 | ||
3266 | The sequence \g followed by an unsigned or a negative number, option- | The sequence \g followed by an unsigned or a negative number, option- |
3267 | ally enclosed in braces, is an absolute or relative back reference. A | ally enclosed in braces, is an absolute or relative back reference. A |
3268 | named back reference can be coded as \g{name}. Back references are dis- | named back reference can be coded as \g{name}. Back references are dis- |
3269 | cussed later, following the discussion of parenthesized subpatterns. | cussed later, following the discussion of parenthesized subpatterns. |
3270 | ||
3271 | Absolute and relative subroutine calls | Absolute and relative subroutine calls |
3272 | ||
3273 | For compatibility with Oniguruma, the non-Perl syntax \g followed by a | For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
3274 | name or a number enclosed either in angle brackets or single quotes, is | name or a number enclosed either in angle brackets or single quotes, is |
3275 | an alternative syntax for referencing a subpattern as a "subroutine". | an alternative syntax for referencing a subpattern as a "subroutine". |
3276 | Details are discussed later. Note that \g{...} (Perl syntax) and | Details are discussed later. Note that \g{...} (Perl syntax) and |
3277 | \g<...> (Oniguruma syntax) are not synonymous. The former is a back | \g<...> (Oniguruma syntax) are not synonymous. The former is a back |
3278 | reference; the latter is a subroutine call. | reference; the latter is a subroutine call. |
3279 | ||
3280 | Generic character types | Generic character types |
# | Line 3216 BACKSLASH | Line 3294 BACKSLASH |
3294 | \W any "non-word" character | \W any "non-word" character |
3295 | ||
3296 | Each pair of escape sequences partitions the complete set of characters | Each pair of escape sequences partitions the complete set of characters |
3297 | into two disjoint sets. Any given character matches one, and only one, | into two disjoint sets. Any given character matches one, and only one, |
3298 | of each pair. | of each pair. |
3299 | ||
3300 | These character type sequences can appear both inside and outside char- | These character type sequences can appear both inside and outside char- |
3301 | acter classes. They each match one character of the appropriate type. | acter classes. They each match one character of the appropriate type. |
3302 | If the current matching point is at the end of the subject string, all | If the current matching point is at the end of the subject string, all |
3303 | of them fail, since there is no character to match. | of them fail, since there is no character to match. |
3304 | ||
3305 | For compatibility with Perl, \s does not match the VT character (code | For compatibility with Perl, \s does not match the VT character (code |
3306 | 11). This makes it different from the the POSIX "space" class. The \s | 11). This makes it different from the the POSIX "space" class. The \s |
3307 | characters are HT (9), LF (10), FF (12), CR (13), and space (32). If | characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
3308 | "use locale;" is included in a Perl script, \s may match the VT charac- | "use locale;" is included in a Perl script, \s may match the VT charac- |
3309 | ter. In PCRE, it never does. | ter. In PCRE, it never does. |
3310 | ||
3311 | In UTF-8 mode, characters with values greater than 128 never match \d, | In UTF-8 mode, characters with values greater than 128 never match \d, |
3312 | \s, or \w, and always match \D, \S, and \W. This is true even when Uni- | \s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
3313 | code character property support is available. These sequences retain | code character property support is available. These sequences retain |
3314 | their original meanings from before UTF-8 support was available, mainly | their original meanings from before UTF-8 support was available, mainly |
3315 | for efficiency reasons. | for efficiency reasons. Note that this also affects \b, because it is |
3316 | defined in terms of \w and \W. | |
3317 | ||
3318 | The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to | The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
3319 | the other sequences, these do match certain high-valued codepoints in | the other sequences, these do match certain high-valued codepoints in |
# | Line 3428 BACKSLASH | Line 3507 BACKSLASH |
3507 | U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see | U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
3508 | RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- | RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
3509 | ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in | ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
3510 | the pcreapi page). | the pcreapi page). Perl does not support the Cs property. |
3511 | ||
3512 | The long synonyms for these properties that Perl supports (such as | The long synonyms for property names that Perl supports (such as |
3513 | \p{Letter}) are not supported by PCRE, nor is it permitted to prefix | \p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
3514 | any of these properties with "Is". | any of these properties with "Is". |
3515 | ||
# | Line 3760 POSIX CHARACTER CLASSES | Line 3839 POSIX CHARACTER CLASSES |
3839 | ||
3840 | VERTICAL BAR | VERTICAL BAR |
3841 | ||
3842 | Vertical bar characters are used to separate alternative patterns. For | Vertical bar characters are used to separate alternative patterns. For |
3843 | example, the pattern | example, the pattern |
3844 | ||
3845 | gilbert|sullivan | gilbert|sullivan |
3846 | ||
3847 | matches either "gilbert" or "sullivan". Any number of alternatives may | matches either "gilbert" or "sullivan". Any number of alternatives may |
3848 | appear, and an empty alternative is permitted (matching the empty | appear, and an empty alternative is permitted (matching the empty |
3849 | string). The matching process tries each alternative in turn, from left | string). The matching process tries each alternative in turn, from left |
3850 | to right, and the first one that succeeds is used. If the alternatives | to right, and the first one that succeeds is used. If the alternatives |
3851 | are within a subpattern (defined below), "succeeds" means matching the | are within a subpattern (defined below), "succeeds" means matching the |
3852 | rest of the main pattern as well as the alternative in the subpattern. | rest of the main pattern as well as the alternative in the subpattern. |
3853 | ||
3854 | ||
3855 | INTERNAL OPTION SETTING | INTERNAL OPTION SETTING |
# | Line 3796 INTERNAL OPTION SETTING | Line 3875 INTERNAL OPTION SETTING |
3875 | can be changed in the same way as the Perl-compatible options by using | can be changed in the same way as the Perl-compatible options by using |
3876 | the characters J, U and X respectively. | the characters J, U and X respectively. |
3877 | ||
3878 | When an option change occurs at top level (that is, not inside subpat- | When one of these option changes occurs at top level (that is, not |
3879 | tern parentheses), the change applies to the remainder of the pattern | inside subpattern parentheses), the change applies to the remainder of |
3880 | that follows. If the change is placed right at the start of a pattern, | the pattern that follows. If the change is placed right at the start of |
3881 | PCRE extracts it into the global options (and it will therefore show up | a pattern, PCRE extracts it into the global options (and it will there- |
3882 | in data extracted by the pcre_fullinfo() function). | fore show up in data extracted by the pcre_fullinfo() function). |
3883 | ||
3884 | An option change within a subpattern (see below for a description of | An option change within a subpattern (see below for a description of |
3885 | subpatterns) affects only that part of the current pattern that follows | subpatterns) affects only that part of the current pattern that follows |
# | Line 3823 INTERNAL OPTION SETTING | Line 3902 INTERNAL OPTION SETTING |
3902 | ||
3903 | Note: There are other PCRE-specific options that can be set by the | Note: There are other PCRE-specific options that can be set by the |
3904 | application when the compile or match functions are called. In some | application when the compile or match functions are called. In some |
3905 | cases the pattern can contain special leading sequences to override | cases the pattern can contain special leading sequences such as (*CRLF) |
3906 | what the application has set or what has been defaulted. Details are | to override what the application has set or what has been defaulted. |
3907 | given in the section entitled "Newline sequences" above. | Details are given in the section entitled "Newline sequences" above. |
3908 | There is also the (*UTF8) leading sequence that can be used to set | |
3909 | UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option. | |
3910 | ||
3911 | ||
3912 | SUBPATTERNS | SUBPATTERNS |
# | Line 3964 NAMED SUBPATTERNS | Line 4045 NAMED SUBPATTERNS |
4045 | lowest number is used. For further details of the interfaces for han- | lowest number is used. For further details of the interfaces for han- |
4046 | dling named subpatterns, see the pcreapi documentation. | dling named subpatterns, see the pcreapi documentation. |
4047 | ||
4048 | Warning: You cannot use different names to distinguish between two sub- | |
4049 | patterns with the same number (see the previous section) because PCRE | |
4050 | uses only the numbers when matching. | |
4051 | ||
4052 | ||
4053 | REPETITION | REPETITION |
4054 | ||
# | Line 4004 REPETITION | Line 4089 REPETITION |
4089 | the syntax of a quantifier, is taken as a literal character. For exam- | the syntax of a quantifier, is taken as a literal character. For exam- |
4090 | ple, {,6} is not a quantifier, but a literal string of four characters. | ple, {,6} is not a quantifier, but a literal string of four characters. |
4091 | ||
4092 | In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to | In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
4093 | individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- | individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- |
4094 | acters, each of which is represented by a two-byte sequence. Similarly, | acters, each of which is represented by a two-byte sequence. Similarly, |
4095 | when Unicode property support is available, \X{3} matches three Unicode | when Unicode property support is available, \X{3} matches three Unicode |
4096 | extended sequences, each of which may be several bytes long (and they | extended sequences, each of which may be several bytes long (and they |
4097 | may be of different lengths). | may be of different lengths). |
4098 | ||
4099 | The quantifier {0} is permitted, causing the expression to behave as if | The quantifier {0} is permitted, causing the expression to behave as if |
4100 | the previous item and the quantifier were not present. This may be use- | the previous item and the quantifier were not present. This may be use- |
4101 | ful for subpatterns that are referenced as subroutines from elsewhere | ful for subpatterns that are referenced as subroutines from elsewhere |
4102 | in the pattern. Items other than subpatterns that have a {0} quantifier | in the pattern. Items other than subpatterns that have a {0} quantifier |
4103 | are omitted from the compiled pattern. | are omitted from the compiled pattern. |
4104 | ||
4105 | For convenience, the three most common quantifiers have single-charac- | For convenience, the three most common quantifiers have single-charac- |
4106 | ter abbreviations: | ter abbreviations: |
4107 | ||
4108 | * is equivalent to {0,} | * is equivalent to {0,} |
4109 | + is equivalent to {1,} | + is equivalent to {1,} |
4110 | ? is equivalent to {0,1} | ? is equivalent to {0,1} |
4111 | ||
4112 | It is possible to construct infinite loops by following a subpattern | It is possible to construct infinite loops by following a subpattern |
4113 | that can match no characters with a quantifier that has no upper limit, | that can match no characters with a quantifier that has no upper limit, |
4114 | for example: | for example: |
4115 | ||
4116 | (a?)* | (a?)* |
4117 | ||
4118 | Earlier versions of Perl and PCRE used to give an error at compile time | Earlier versions of Perl and PCRE used to give an error at compile time |
4119 | for such patterns. However, because there are cases where this can be | for such patterns. However, because there are cases where this can be |
4120 | useful, such patterns are now accepted, but if any repetition of the | useful, such patterns are now accepted, but if any repetition of the |
4121 | subpattern does in fact match no characters, the loop is forcibly bro- | subpattern does in fact match no characters, the loop is forcibly bro- |
4122 | ken. | ken. |
4123 | ||
4124 | By default, the quantifiers are "greedy", that is, they match as much | By default, the quantifiers are "greedy", that is, they match as much |
4125 | as possible (up to the maximum number of permitted times), without | as possible (up to the maximum number of permitted times), without |
4126 | causing the rest of the pattern to fail. The classic example of where | causing the rest of the pattern to fail. The classic example of where |
4127 | this gives problems is in trying to match comments in C programs. These | this gives problems is in trying to match comments in C programs. These |
4128 | appear between /* and */ and within the comment, individual * and / | appear between /* and */ and within the comment, individual * and / |
4129 | characters may appear. An attempt to match C comments by applying the | characters may appear. An attempt to match C comments by applying the |
4130 | pattern | pattern |
4131 | ||
4132 | /\*.*\*/ | /\*.*\*/ |
# | Line 4050 REPETITION | Line 4135 REPETITION |
4135 | ||
4136 | /* first comment */ not comment /* second comment */ | /* first comment */ not comment /* second comment */ |
4137 | ||
4138 | fails, because it matches the entire string owing to the greediness of | fails, because it matches the entire string owing to the greediness of |
4139 | the .* item. | the .* item. |
4140 | ||
4141 | However, if a quantifier is followed by a question mark, it ceases to | However, if a quantifier is followed by a question mark, it ceases to |
4142 | be greedy, and instead matches the minimum number of times possible, so | be greedy, and instead matches the minimum number of times possible, so |
4143 | the pattern | the pattern |
4144 | ||
4145 | /\*.*?\*/ | /\*.*?\*/ |
4146 | ||
4147 | does the right thing with the C comments. The meaning of the various | does the right thing with the C comments. The meaning of the various |
4148 | quantifiers is not otherwise changed, just the preferred number of | quantifiers is not otherwise changed, just the preferred number of |
4149 | matches. Do not confuse this use of question mark with its use as a | matches. Do not confuse this use of question mark with its use as a |
4150 | quantifier in its own right. Because it has two uses, it can sometimes | quantifier in its own right. Because it has two uses, it can sometimes |
4151 | appear doubled, as in | appear doubled, as in |
4152 | ||
4153 | \d??\d | \d??\d |
# | Line 4070 REPETITION | Line 4155 REPETITION |
4155 | which matches one digit by preference, but can match two if that is the | which matches one digit by preference, but can match two if that is the |
4156 | only way the rest of the pattern matches. | only way the rest of the pattern matches. |
4157 | ||
4158 | If the PCRE_UNGREEDY option is set (an option that is not available in | If the PCRE_UNGREEDY option is set (an option that is not available in |
4159 | Perl), the quantifiers are not greedy by default, but individual ones | Perl), the quantifiers are not greedy by default, but individual ones |
4160 | can be made greedy by following them with a question mark. In other | can be made greedy by following them with a question mark. In other |
4161 | words, it inverts the default behaviour. | words, it inverts the default behaviour. |
4162 | ||
4163 | When a parenthesized subpattern is quantified with a minimum repeat | When a parenthesized subpattern is quantified with a minimum repeat |
4164 | count that is greater than 1 or with a limited maximum, more memory is | count that is greater than 1 or with a limited maximum, more memory is |
4165 | required for the compiled pattern, in proportion to the size of the | required for the compiled pattern, in proportion to the size of the |
4166 | minimum or maximum. | minimum or maximum. |
4167 | ||
4168 | If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- | If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- |
4169 | alent to Perl's /s) is set, thus allowing the dot to match newlines, | alent to Perl's /s) is set, thus allowing the dot to match newlines, |
4170 | the pattern is implicitly anchored, because whatever follows will be | the pattern is implicitly anchored, because whatever follows will be |
4171 | tried against every character position in the subject string, so there | tried against every character position in the subject string, so there |
4172 | is no point in retrying the overall match at any position after the | is no point in retrying the overall match at any position after the |
4173 | first. PCRE normally treats such a pattern as though it were preceded | first. PCRE normally treats such a pattern as though it were preceded |
4174 | by \A. | by \A. |
4175 | ||
4176 | In cases where it is known that the subject string contains no new- | In cases where it is known that the subject string contains no new- |
4177 | lines, it is worth setting PCRE_DOTALL in order to obtain this opti- | lines, it is worth setting PCRE_DOTALL in order to obtain this opti- |
4178 | mization, or alternatively using ^ to indicate anchoring explicitly. | mization, or alternatively using ^ to indicate anchoring explicitly. |
4179 | ||
4180 | However, there is one situation where the optimization cannot be used. | However, there is one situation where the optimization cannot be used. |
4181 | When .* is inside capturing parentheses that are the subject of a | When .* is inside capturing parentheses that are the subject of a |
4182 | backreference elsewhere in the pattern, a match at the start may fail | backreference elsewhere in the pattern, a match at the start may fail |
4183 | where a later one succeeds. Consider, for example: | where a later one succeeds. Consider, for example: |
4184 | ||
4185 | (.*)abc\1 | (.*)abc\1 |
4186 | ||
4187 | If the subject is "xyz123abc123" the match point is the fourth charac- | If the subject is "xyz123abc123" the match point is the fourth charac- |
4188 | ter. For this reason, such a pattern is not implicitly anchored. | ter. For this reason, such a pattern is not implicitly anchored. |
4189 | ||
4190 | When a capturing subpattern is repeated, the value captured is the sub- | When a capturing subpattern is repeated, the value captured is the sub- |
# | Line 4108 REPETITION | Line 4193 REPETITION |
4193 | (tweedle[dume]{3}\s*)+ | (tweedle[dume]{3}\s*)+ |
4194 | ||
4195 | has matched "tweedledum tweedledee" the value of the captured substring | has matched "tweedledum tweedledee" the value of the captured substring |
4196 | is "tweedledee". However, if there are nested capturing subpatterns, | is "tweedledee". However, if there are nested capturing subpatterns, |
4197 | the corresponding captured values may have been set in previous itera- | the corresponding captured values may have been set in previous itera- |
4198 | tions. For example, after | tions. For example, after |
4199 | ||
4200 | /(a|(b))+/ | /(a|(b))+/ |
# | Line 4119 REPETITION | Line 4204 REPETITION |
4204 | ||
4205 | ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS | ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
4206 | ||
4207 | With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") | With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
4208 | repetition, failure of what follows normally causes the repeated item | repetition, failure of what follows normally causes the repeated item |
4209 | to be re-evaluated to see if a different number of repeats allows the | to be re-evaluated to see if a different number of repeats allows the |
4210 | rest of the pattern to match. Sometimes it is useful to prevent this, | rest of the pattern to match. Sometimes it is useful to prevent this, |
4211 | either to change the nature of the match, or to cause it fail earlier | either to change the nature of the match, or to cause it fail earlier |
4212 | than it otherwise might, when the author of the pattern knows there is | than it otherwise might, when the author of the pattern knows there is |
4213 | no point in carrying on. | no point in carrying on. |
4214 | ||
4215 | Consider, for example, the pattern \d+foo when applied to the subject | Consider, for example, the pattern \d+foo when applied to the subject |
4216 | line | line |
4217 | ||
4218 | 123456bar | 123456bar |
4219 | ||
4220 | After matching all 6 digits and then failing to match "foo", the normal | After matching all 6 digits and then failing to match "foo", the normal |
4221 | action of the matcher is to try again with only 5 digits matching the | action of the matcher is to try again with only 5 digits matching the |
4222 | \d+ item, and then with 4, and so on, before ultimately failing. | \d+ item, and then with 4, and so on, before ultimately failing. |
4223 | "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides | "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides |
4224 | the means for specifying that once a subpattern has matched, it is not | the means for specifying that once a subpattern has matched, it is not |
4225 | to be re-evaluated in this way. | to be re-evaluated in this way. |
4226 | ||
4227 | If we use atomic grouping for the previous example, the matcher gives | If we use atomic grouping for the previous example, the matcher gives |
4228 | up immediately on failing to match "foo" the first time. The notation | up immediately on failing to match "foo" the first time. The notation |
4229 | is a kind of special parenthesis, starting with (?> as in this example: | is a kind of special parenthesis, starting with (?> as in this example: |
4230 | ||
4231 | (?>\d+)foo | (?>\d+)foo |
# | Line 4218 ATOMIC GROUPING AND POSSESSIVE QUANTIFIE | Line 4303 ATOMIC GROUPING AND POSSESSIVE QUANTIFIE |
4303 | ||
4304 | ((?>\D+)|<\d+>)*[!?] | ((?>\D+)|<\d+>)*[!?] |
4305 | ||
4306 | sequences of non-digits cannot be broken, and failure happens quickly. | sequences of non-digits cannot be broken, and failure happens quickly. |
4307 | ||
4308 | ||
4309 | BACK REFERENCES | BACK REFERENCES |
4310 | ||
4311 | Outside a character class, a backslash followed by a digit greater than | Outside a character class, a backslash followed by a digit greater than |
4312 | 0 (and possibly further digits) is a back reference to a capturing sub- | 0 (and possibly further digits) is a back reference to a capturing sub- |
4313 | pattern earlier (that is, to its left) in the pattern, provided there | pattern earlier (that is, to its left) in the pattern, provided there |
4314 | have been that many previous capturing left parentheses. | have been that many previous capturing left parentheses. |
4315 | ||
4316 | However, if the decimal number following the backslash is less than 10, | However, if the decimal number following the backslash is less than 10, |
4317 | it is always taken as a back reference, and causes an error only if | it is always taken as a back reference, and causes an error only if |
4318 | there are not that many capturing left parentheses in the entire pat- | there are not that many capturing left parentheses in the entire pat- |
4319 | tern. In other words, the parentheses that are referenced need not be | tern. In other words, the parentheses that are referenced need not be |
4320 | to the left of the reference for numbers less than 10. A "forward back | to the left of the reference for numbers less than 10. A "forward back |
4321 | reference" of this type can make sense when a repetition is involved | reference" of this type can make sense when a repetition is involved |
4322 | and the subpattern to the right has participated in an earlier itera- | and the subpattern to the right has participated in an earlier itera- |
4323 | tion. | tion. |
4324 | ||
4325 | It is not possible to have a numerical "forward back reference" to a | It is not possible to have a numerical "forward back reference" to a |
4326 | subpattern whose number is 10 or more using this syntax because a | subpattern whose number is 10 or more using this syntax because a |
4327 | sequence such as \50 is interpreted as a character defined in octal. | sequence such as \50 is interpreted as a character defined in octal. |
4328 | See the subsection entitled "Non-printing characters" above for further | See the subsection entitled "Non-printing characters" above for further |
4329 | details of the handling of digits following a backslash. There is no | details of the handling of digits following a backslash. There is no |
4330 | such problem when named parentheses are used. A back reference to any | such problem when named parentheses are used. A back reference to any |
4331 | subpattern is possible using named parentheses (see below). | subpattern is possible using named parentheses (see below). |
4332 | ||
4333 | Another way of avoiding the ambiguity inherent in the use of digits | Another way of avoiding the ambiguity inherent in the use of digits |
4334 | following a backslash is to use the \g escape sequence, which is a fea- | following a backslash is to use the \g escape sequence, which is a fea- |
4335 | ture introduced in Perl 5.10. This escape must be followed by an | ture introduced in Perl 5.10. This escape must be followed by an |
4336 | unsigned number or a negative number, optionally enclosed in braces. | unsigned number or a negative number, optionally enclosed in braces. |
4337 | These examples are all identical: | These examples are all identical: |
4338 | ||
4339 | (ring), \1 | (ring), \1 |
4340 | (ring), \g1 | (ring), \g1 |
4341 | (ring), \g{1} | (ring), \g{1} |
4342 | ||
4343 | An unsigned number specifies an absolute reference without the ambigu- | An unsigned number specifies an absolute reference without the ambigu- |
4344 | ity that is present in the older syntax. It is also useful when literal | ity that is present in the older syntax. It is also useful when literal |
4345 | digits follow the reference. A negative number is a relative reference. | digits follow the reference. A negative number is a relative reference. |
4346 | Consider this example: | Consider this example: |
# | Line 4263 BACK REFERENCES | Line 4348 BACK REFERENCES |
4348 | (abc(def)ghi)\g{-1} | (abc(def)ghi)\g{-1} |
4349 | ||
4350 | The sequence \g{-1} is a reference to the most recently started captur- | The sequence \g{-1} is a reference to the most recently started captur- |
4351 | ing subpattern before \g, that is, is it equivalent to \2. Similarly, | ing subpattern before \g, that is, is it equivalent to \2. Similarly, |
4352 | \g{-2} would be equivalent to \1. The use of relative references can be | \g{-2} would be equivalent to \1. The use of relative references can be |
4353 | helpful in long patterns, and also in patterns that are created by | helpful in long patterns, and also in patterns that are created by |
4354 | joining together fragments that contain references within themselves. | joining together fragments that contain references within themselves. |
4355 | ||
4356 | A back reference matches whatever actually matched the capturing sub- | A back reference matches whatever actually matched the capturing sub- |
4357 | pattern in the current subject string, rather than anything matching | pattern in the current subject string, rather than anything matching |
4358 | the subpattern itself (see "Subpatterns as subroutines" below for a way | the subpattern itself (see "Subpatterns as subroutines" below for a way |
4359 | of doing that). So the pattern | of doing that). So the pattern |
4360 | ||
4361 | (sens|respons)e and \1ibility | (sens|respons)e and \1ibility |
4362 | ||
4363 | matches "sense and sensibility" and "response and responsibility", but | matches "sense and sensibility" and "response and responsibility", but |
4364 | not "sense and responsibility". If caseful matching is in force at the | not "sense and responsibility". If caseful matching is in force at the |
4365 | time of the back reference, the case of letters is relevant. For exam- | time of the back reference, the case of letters is relevant. For exam- |
4366 | ple, | ple, |
4367 | ||
4368 | ((?i)rah)\s+\1 | ((?i)rah)\s+\1 |
4369 | ||
4370 | matches "rah rah" and "RAH RAH", but not "RAH rah", even though the | matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
4371 | original capturing subpattern is matched caselessly. | original capturing subpattern is matched caselessly. |
4372 | ||
4373 | There are several different ways of writing back references to named | There are several different ways of writing back references to named |
4374 | subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or | subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
4375 | \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's | \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
4376 | unified back reference syntax, in which \g can be used for both numeric | unified back reference syntax, in which \g can be used for both numeric |
4377 | and named references, is also supported. We could rewrite the above | and named references, is also supported. We could rewrite the above |
4378 | example in any of the following ways: | example in any of the following ways: |
4379 | ||
4380 | (?<p1>(?i)rah)\s+\k<p1> | (?<p1>(?i)rah)\s+\k<p1> |
# | Line 4297 BACK REFERENCES | Line 4382 BACK REFERENCES |
4382 | (?P<p1>(?i)rah)\s+(?P=p1) | (?P<p1>(?i)rah)\s+(?P=p1) |
4383 | (?<p1>(?i)rah)\s+\g{p1} | (?<p1>(?i)rah)\s+\g{p1} |
4384 | ||
4385 | A subpattern that is referenced by name may appear in the pattern | A subpattern that is referenced by name may appear in the pattern |
4386 | before or after the reference. | before or after the reference. |
4387 | ||
4388 | There may be more than one back reference to the same subpattern. If a | There may be more than one back reference to the same subpattern. If a |
4389 | subpattern has not actually been used in a particular match, any back | subpattern has not actually been used in a particular match, any back |
4390 | references to it always fail. For example, the pattern | references to it always fail. For example, the pattern |
4391 | ||
4392 | (a|(bc))\2 | (a|(bc))\2 |
4393 | ||
4394 | always fails if it starts to match "a" rather than "bc". Because there | always fails if it starts to match "a" rather than "bc". Because there |
4395 | may be many capturing parentheses in a pattern, all digits following | may be many capturing parentheses in a pattern, all digits following |
4396 | the backslash are taken as part of a potential back reference number. | the backslash are taken as part of a potential back reference number. |
4397 | If the pattern continues with a digit character, some delimiter must be | If the pattern continues with a digit character, some delimiter must be |
4398 | used to terminate the back reference. If the PCRE_EXTENDED option is | used to terminate the back reference. If the PCRE_EXTENDED option is |
4399 | set, this can be whitespace. Otherwise an empty comment (see "Com- | set, this can be whitespace. Otherwise an empty comment (see "Com- |
4400 | ments" below) can be used. | ments" below) can be used. |
4401 | ||
4402 | A back reference that occurs inside the parentheses to which it refers | A back reference that occurs inside the parentheses to which it refers |
4403 | fails when the subpattern is first used, so, for example, (a\1) never | fails when the subpattern is first used, so, for example, (a\1) never |
4404 | matches. However, such references can be useful inside repeated sub- | matches. However, such references can be useful inside repeated sub- |
4405 | patterns. For example, the pattern | patterns. For example, the pattern |
4406 | ||
4407 | (a|b\1)+ | (a|b\1)+ |
4408 | ||
4409 | matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- | matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
4410 | ation of the subpattern, the back reference matches the character | ation of the subpattern, the back reference matches the character |
4411 | string corresponding to the previous iteration. In order for this to | string corresponding to the previous iteration. In order for this to |
4412 | work, the pattern must be such that the first iteration does not need | work, the pattern must be such that the first iteration does not need |
4413 | to match the back reference. This can be done using alternation, as in | to match the back reference. This can be done using alternation, as in |
4414 | the example above, or by a quantifier with a minimum of zero. | the example above, or by a quantifier with a minimum of zero. |
4415 | ||
4416 | ||
4417 | ASSERTIONS | ASSERTIONS |
4418 | ||
4419 | An assertion is a test on the characters following or preceding the | An assertion is a test on the characters following or preceding the |
4420 | current matching point that does not actually consume any characters. | current matching point that does not actually consume any characters. |
4421 | The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are | The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
4422 | described above. | described above. |
4423 | ||
4424 | More complicated assertions are coded as subpatterns. There are two | More complicated assertions are coded as subpatterns. There are two |
4425 | kinds: those that look ahead of the current position in the subject | kinds: those that look ahead of the current position in the subject |
4426 | string, and those that look behind it. An assertion subpattern is | string, and those that look behind it. An assertion subpattern is |
4427 | matched in the normal way, except that it does not cause the current | matched in the normal way, except that it does not cause the current |
4428 | matching position to be changed. | matching position to be changed. |
4429 | ||
4430 | Assertion subpatterns are not capturing subpatterns, and may not be | Assertion subpatterns are not capturing subpatterns, and may not be |
4431 | repeated, because it makes no sense to assert the same thing several | repeated, because it makes no sense to assert the same thing several |
4432 | times. If any kind of assertion contains capturing subpatterns within | times. If any kind of assertion contains capturing subpatterns within |
4433 | it, these are counted for the purposes of numbering the capturing sub- | it, these are counted for the purposes of numbering the capturing sub- |
4434 | patterns in the whole pattern. However, substring capturing is carried | patterns in the whole pattern. However, substring capturing is carried |
4435 | out only for positive assertions, because it does not make sense for | out only for positive assertions, because it does not make sense for |
4436 | negative assertions. | negative assertions. |
4437 | ||
4438 | Lookahead assertions | Lookahead assertions |
# | Line 4357 ASSERTIONS | Line 4442 ASSERTIONS |
4442 | ||
4443 | \w+(?=;) | \w+(?=;) |
4444 | ||
4445 | matches a word followed by a semicolon, but does not include the semi- | matches a word followed by a semicolon, but does not include the semi- |
4446 | colon in the match, and | colon in the match, and |
4447 | ||
4448 | foo(?!bar) | foo(?!bar) |
4449 | ||
4450 | matches any occurrence of "foo" that is not followed by "bar". Note | matches any occurrence of "foo" that is not followed by "bar". Note |
4451 | that the apparently similar pattern | that the apparently similar pattern |
4452 | ||
4453 | (?!foo)bar | (?!foo)bar |
4454 | ||
4455 | does not find an occurrence of "bar" that is preceded by something | does not find an occurrence of "bar" that is preceded by something |
4456 | other than "foo"; it finds any occurrence of "bar" whatsoever, because | other than "foo"; it finds any occurrence of "bar" whatsoever, because |
4457 | the assertion (?!foo) is always true when the next three characters are | the assertion (?!foo) is always true when the next three characters are |
4458 | "bar". A lookbehind assertion is needed to achieve the other effect. | "bar". A lookbehind assertion is needed to achieve the other effect. |
4459 | ||
4460 | If you want to force a matching failure at some point in a pattern, the | If you want to force a matching failure at some point in a pattern, the |
4461 | most convenient way to do it is with (?!) because an empty string | most convenient way to do it is with (?!) because an empty string |
4462 | always matches, so an assertion that requires there not to be an empty | always matches, so an assertion that requires there not to be an empty |
4463 | string must always fail. | string must always fail. |
4464 | ||
4465 | Lookbehind assertions | Lookbehind assertions |
4466 | ||
4467 | Lookbehind assertions start with (?<= for positive assertions and (?<! | Lookbehind assertions start with (?<= for positive assertions and (?<! |
4468 | for negative assertions. For example, | for negative assertions. For example, |
4469 | ||
4470 | (?<!foo)bar | (?<!foo)bar |
4471 | ||
4472 | does find an occurrence of "bar" that is not preceded by "foo". The | does find an occurrence of "bar" that is not preceded by "foo". The |
4473 | contents of a lookbehind assertion are restricted such that all the | contents of a lookbehind assertion are restricted such that all the |
4474 | strings it matches must have a fixed length. However, if there are sev- | strings it matches must have a fixed length. However, if there are sev- |
4475 | eral top-level alternatives, they do not all have to have the same | eral top-level alternatives, they do not all have to have the same |
4476 | fixed length. Thus | fixed length. Thus |
4477 | ||
4478 | (?<=bullock|donkey) | (?<=bullock|donkey) |
# | Line 4396 ASSERTIONS | Line 4481 ASSERTIONS |
4481 | ||
4482 | (?<!dogs?|cats?) | (?<!dogs?|cats?) |
4483 | ||
4484 | causes an error at compile time. Branches that match different length | causes an error at compile time. Branches that match different length |
4485 | strings are permitted only at the top level of a lookbehind assertion. | strings are permitted only at the top level of a lookbehind assertion. |
4486 | This is an extension compared with Perl (at least for 5.8), which | This is an extension compared with Perl (at least for 5.8), which |
4487 | requires all branches to match the same length of string. An assertion | requires all branches to match the same length of string. An assertion |
4488 | such as | such as |
4489 | ||
4490 | (?<=ab(c|de)) | (?<=ab(c|de)) |
4491 | ||
4492 | is not permitted, because its single top-level branch can match two | is not permitted, because its single top-level branch can match two |
4493 | different lengths, but it is acceptable if rewritten to use two top- | different lengths, but it is acceptable if rewritten to use two top- |
4494 | level branches: | level branches: |
4495 | ||
4496 | (?<=abc|abde) | (?<=abc|abde) |
4497 | ||
4498 | In some cases, the Perl 5.10 escape sequence \K (see above) can be used | In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
4499 | instead of a lookbehind assertion; this is not restricted to a fixed- | instead of a lookbehind assertion; this is not restricted to a fixed- |
4500 | length. | length. |
4501 | ||
4502 | The implementation of lookbehind assertions is, for each alternative, | The implementation of lookbehind assertions is, for each alternative, |
4503 | to temporarily move the current position back by the fixed length and | to temporarily move the current position back by the fixed length and |
4504 | then try to match. If there are insufficient characters before the cur- | then try to match. If there are insufficient characters before the cur- |
4505 | rent position, the assertion fails. | rent position, the assertion fails. |
4506 | ||
4507 | PCRE does not allow the \C escape (which matches a single byte in UTF-8 | PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
4508 | mode) to appear in lookbehind assertions, because it makes it impossi- | mode) to appear in lookbehind assertions, because it makes it impossi- |
4509 | ble to calculate the length of the lookbehind. The \X and \R escapes, | ble to calculate the length of the lookbehind. The \X and \R escapes, |
4510 | which can match different numbers of bytes, are also not permitted. | which can match different numbers of bytes, are also not permitted. |
4511 | ||
4512 | Possessive quantifiers can be used in conjunction with lookbehind | Possessive quantifiers can be used in conjunction with lookbehind |
4513 | assertions to specify efficient matching at the end of the subject | assertions to specify efficient matching at the end of the subject |
4514 | string. Consider a simple pattern such as | string. Consider a simple pattern such as |
4515 | ||
4516 | abcd$ | abcd$ |
4517 | ||
4518 | when applied to a long string that does not match. Because matching | when applied to a long string that does not match. Because matching |
4519 | proceeds from left to right, PCRE will look for each "a" in the subject | proceeds from left to right, PCRE will look for each "a" in the subject |
4520 | and then see if what follows matches the rest of the pattern. If the | and then see if what follows matches the rest of the pattern. If the |
4521 | pattern is specified as | pattern is specified as |
4522 | ||
4523 | ^.*abcd$ | ^.*abcd$ |
4524 | ||
4525 | the initial .* matches the entire string at first, but when this fails | the initial .* matches the entire string at first, but when this fails |
4526 | (because there is no following "a"), it backtracks to match all but the | (because there is no following "a"), it backtracks to match all but the |
4527 | last character, then all but the last two characters, and so on. Once | last character, then all but the last two characters, and so on. Once |
4528 | again the search for "a" covers the entire string, from right to left, | again the search for "a" covers the entire string, from right to left, |
4529 | so we are no better off. However, if the pattern is written as | so we are no better off. However, if the pattern is written as |
4530 | ||
4531 | ^.*+(?<=abcd) | ^.*+(?<=abcd) |
4532 | ||
4533 | there can be no backtracking for the .*+ item; it can match only the | there can be no backtracking for the .*+ item; it can match only the |
4534 | entire string. The subsequent lookbehind assertion does a single test | entire string. The subsequent lookbehind assertion does a single test |
4535 | on the last four characters. If it fails, the match fails immediately. | on the last four characters. If it fails, the match fails immediately. |
4536 | For long strings, this approach makes a significant difference to the | For long strings, this approach makes a significant difference to the |
4537 | processing time. | processing time. |
4538 | ||
4539 | Using multiple assertions | Using multiple assertions |
# | Line 4457 ASSERTIONS | Line 4542 ASSERTIONS |
4542 | ||
4543 | (?<=\d{3})(?<!999)foo | (?<=\d{3})(?<!999)foo |
4544 | ||
4545 | matches "foo" preceded by three digits that are not "999". Notice that | matches "foo" preceded by three digits that are not "999". Notice that |
4546 | each of the assertions is applied independently at the same point in | each of the assertions is applied independently at the same point in |
4547 | the subject string. First there is a check that the previous three | the subject string. First there is a check that the previous three |
4548 | characters are all digits, and then there is a check that the same | characters are all digits, and then there is a check that the same |
4549 | three characters are not "999". This pattern does not match "foo" pre- | three characters are not "999". This pattern does not match "foo" pre- |
4550 | ceded by six characters, the first of which are digits and the last | ceded by six characters, the first of which are digits and the last |
4551 | three of which are not "999". For example, it doesn't match "123abc- | three of which are not "999". For example, it doesn't match "123abc- |
4552 | foo". A pattern to do that is | foo". A pattern to do that is |
4553 | ||
4554 | (?<=\d{3}...)(?<!999)foo | (?<=\d{3}...)(?<!999)foo |
4555 | ||
4556 | This time the first assertion looks at the preceding six characters, | This time the first assertion looks at the preceding six characters, |
4557 | checking that the first three are digits, and then the second assertion | checking that the first three are digits, and then the second assertion |
4558 | checks that the preceding three characters are not "999". | checks that the preceding three characters are not "999". |
4559 | ||
# | Line 4476 ASSERTIONS | Line 4561 ASSERTIONS |
4561 | ||
4562 | (?<=(?<!foo)bar)baz | (?<=(?<!foo)bar)baz |
4563 | ||
4564 | matches an occurrence of "baz" that is preceded by "bar" which in turn | matches an occurrence of "baz" that is preceded by "bar" which in turn |
4565 | is not preceded by "foo", while | is not preceded by "foo", while |
4566 | ||
4567 | (?<=\d{3}(?!999)...)foo | (?<=\d{3}(?!999)...)foo |
4568 | ||
4569 | is another pattern that matches "foo" preceded by three digits and any | is another pattern that matches "foo" preceded by three digits and any |
4570 | three characters that are not "999". | three characters that are not "999". |
4571 | ||
4572 | ||
4573 | CONDITIONAL SUBPATTERNS | CONDITIONAL SUBPATTERNS |
4574 | ||
4575 | It is possible to cause the matching process to obey a subpattern con- | It is possible to cause the matching process to obey a subpattern con- |
4576 | ditionally or to choose between two alternative subpatterns, depending | ditionally or to choose between two alternative subpatterns, depending |
4577 | on the result of an assertion, or whether a previous capturing subpat- | on the result of an assertion, or whether a previous capturing subpat- |
4578 | tern matched or not. The two possible forms of conditional subpattern | tern matched or not. The two possible forms of conditional subpattern |
4579 | are | are |
4580 | ||
4581 | (?(condition)yes-pattern) | (?(condition)yes-pattern) |
4582 | (?(condition)yes-pattern|no-pattern) | (?(condition)yes-pattern|no-pattern) |
4583 | ||
4584 | If the condition is satisfied, the yes-pattern is used; otherwise the | If the condition is satisfied, the yes-pattern is used; otherwise the |
4585 | no-pattern (if present) is used. If there are more than two alterna- | no-pattern (if present) is used. If there are more than two alterna- |
4586 | tives in the subpattern, a compile-time error occurs. | tives in the subpattern, a compile-time error occurs. |
4587 | ||
4588 | There are four kinds of condition: references to subpatterns, refer- | There are four kinds of condition: references to subpatterns, refer- |
4589 | ences to recursion, a pseudo-condition called DEFINE, and assertions. | ences to recursion, a pseudo-condition called DEFINE, and assertions. |
4590 | ||
4591 | Checking for a used subpattern by number | Checking for a used subpattern by number |
4592 | ||
4593 | If the text between the parentheses consists of a sequence of digits, | If the text between the parentheses consists of a sequence of digits, |
4594 | the condition is true if the capturing subpattern of that number has | the condition is true if the capturing subpattern of that number has |
4595 | previously matched. An alternative notation is to precede the digits | previously matched. An alternative notation is to precede the digits |
4596 | with a plus or minus sign. In this case, the subpattern number is rela- | with a plus or minus sign. In this case, the subpattern number is rela- |
4597 | tive rather than absolute. The most recently opened parentheses can be | tive rather than absolute. The most recently opened parentheses can be |
4598 | referenced by (?(-1), the next most recent by (?(-2), and so on. In | referenced by (?(-1), the next most recent by (?(-2), and so on. In |
4599 | looping constructs it can also make sense to refer to subsequent groups | looping constructs it can also make sense to refer to subsequent groups |
4600 | with constructs such as (?(+2). | with constructs such as (?(+2). |
4601 | ||
4602 | Consider the following pattern, which contains non-significant white | Consider the following pattern, which contains non-significant white |
4603 | space to make it more readable (assume the PCRE_EXTENDED option) and to | space to make it more readable (assume the PCRE_EXTENDED option) and to |
4604 | divide it into three parts for ease of discussion: | divide it into three parts for ease of discussion: |
4605 | ||
4606 | ( \( )? [^()]+ (?(1) \) ) | ( \( )? [^()]+ (?(1) \) ) |
4607 | ||
4608 | The first part matches an optional opening parenthesis, and if that | The first part matches an optional opening parenthesis, and if that |
4609 | character is present, sets it as the first captured substring. The sec- | character is present, sets it as the first captured substring. The sec- |
4610 | ond part matches one or more characters that are not parentheses. The | ond part matches one or more characters that are not parentheses. The |
4611 | third part is a conditional subpattern that tests whether the first set | third part is a conditional subpattern that tests whether the first set |
4612 | of parentheses matched or not. If they did, that is, if subject started | of parentheses matched or not. If they did, that is, if subject started |
4613 | with an opening parenthesis, the condition is true, and so the yes-pat- | with an opening parenthesis, the condition is true, and so the yes-pat- |
4614 | tern is executed and a closing parenthesis is required. Otherwise, | tern is executed and a closing parenthesis is required. Otherwise, |
4615 | since no-pattern is not present, the subpattern matches nothing. In | since no-pattern is not present, the subpattern matches nothing. In |
4616 | other words, this pattern matches a sequence of non-parentheses, | other words, this pattern matches a sequence of non-parentheses, |
4617 | optionally enclosed in parentheses. | optionally enclosed in parentheses. |
4618 | ||
4619 | If you were embedding this pattern in a larger one, you could use a | If you were embedding this pattern in a larger one, you could use a |
4620 | relative reference: | relative reference: |
4621 | ||
4622 | ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... | ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
4623 | ||
4624 | This makes the fragment independent of the parentheses in the larger | This makes the fragment independent of the parentheses in the larger |
4625 | pattern. | pattern. |
4626 | ||
4627 | Checking for a used subpattern by name | Checking for a used subpattern by name |
4628 | ||
4629 | Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a | Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
4630 | used subpattern by name. For compatibility with earlier versions of | used subpattern by name. For compatibility with earlier versions of |
4631 | PCRE, which had this facility before Perl, the syntax (?(name)...) is | PCRE, which had this facility before Perl, the syntax (?(name)...) is |
4632 | also recognized. However, there is a possible ambiguity with this syn- | also recognized. However, there is a possible ambiguity with this syn- |
4633 | tax, because subpattern names may consist entirely of digits. PCRE | tax, because subpattern names may consist entirely of digits. PCRE |
4634 | looks first for a named subpattern; if it cannot find one and the name | looks first for a named subpattern; if it cannot find one and the name |
4635 | consists entirely of digits, PCRE looks for a subpattern of that num- | consists entirely of digits, PCRE looks for a subpattern of that num- |
4636 | ber, which must be greater than zero. Using subpattern names that con- | ber, which must be greater than zero. Using subpattern names that con- |
4637 | sist entirely of digits is not recommended. | sist entirely of digits is not recommended. |
4638 | ||
4639 | Rewriting the above example to use a named subpattern gives this: | Rewriting the above example to use a named subpattern gives this: |
# | Line 4559 CONDITIONAL SUBPATTERNS | Line 4644 CONDITIONAL SUBPATTERNS |
4644 | Checking for pattern recursion | Checking for pattern recursion |
4645 | ||
4646 | If the condition is the string (R), and there is no subpattern with the | If the condition is the string (R), and there is no subpattern with the |
4647 | name R, the condition is true if a recursive call to the whole pattern | name R, the condition is true if a recursive call to the whole pattern |
4648 | or any subpattern has been made. If digits or a name preceded by amper- | or any subpattern has been made. If digits or a name preceded by amper- |
4649 | sand follow the letter R, for example: | sand follow the letter R, for example: |
4650 | ||
4651 | (?(R3)...) or (?(R&name)...) | (?(R3)...) or (?(R&name)...) |
4652 | ||
4653 | the condition is true if the most recent recursion is into the subpat- | the condition is true if the most recent recursion is into the subpat- |
4654 | tern whose number or name is given. This condition does not check the | tern whose number or name is given. This condition does not check the |
4655 | entire recursion stack. | entire recursion stack. |
4656 | ||
4657 | At "top level", all these recursion test conditions are false. Recur- | At "top level", all these recursion test conditions are false. Recur- |
4658 | sive patterns are described below. | sive patterns are described below. |
4659 | ||
4660 | Defining subpatterns for use by reference only | Defining subpatterns for use by reference only |
4661 | ||
4662 | If the condition is the string (DEFINE), and there is no subpattern | If the condition is the string (DEFINE), and there is no subpattern |
4663 | with the name DEFINE, the condition is always false. In this case, | with the name DEFINE, the condition is always false. In this case, |
4664 | there may be only one alternative in the subpattern. It is always | there may be only one alternative in the subpattern. It is always |
4665 | skipped if control reaches this point in the pattern; the idea of | skipped if control reaches this point in the pattern; the idea of |
4666 | DEFINE is that it can be used to define "subroutines" that can be ref- | DEFINE is that it can be used to define "subroutines" that can be ref- |
4667 | erenced from elsewhere. (The use of "subroutines" is described below.) | erenced from elsewhere. (The use of "subroutines" is described below.) |
4668 | For example, a pattern to match an IPv4 address could be written like | For example, a pattern to match an IPv4 address could be written like |
4669 | this (ignore whitespace and line breaks): | this (ignore whitespace and line breaks): |
4670 | ||
4671 | (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) | (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
4672 | \b (?&byte) (\.(?&byte)){3} \b | \b (?&byte) (\.(?&byte)){3} \b |
4673 | ||
4674 | The first part of the pattern is a DEFINE group inside which a another | The first part of the pattern is a DEFINE group inside which a another |
4675 | group named "byte" is defined. This matches an individual component of | group named "byte" is defined. This matches an individual component of |
4676 | an IPv4 address (a number less than 256). When matching takes place, | an IPv4 address (a number less than 256). When matching takes place, |
4677 | this part of the pattern is skipped because DEFINE acts like a false | this part of the pattern is skipped because DEFINE acts like a false |
4678 | condition. | condition. |
4679 | ||
4680 | The rest of the pattern uses references to the named group to match the | The rest of the pattern uses references to the named group to match the |
4681 | four dot-separated components of an IPv4 address, insisting on a word | four dot-separated components of an IPv4 address, insisting on a word |
4682 | boundary at each end. | boundary at each end. |
4683 | ||
4684 | Assertion conditions | Assertion conditions |
4685 | ||
4686 | If the condition is not in any of the above formats, it must be an | If the condition is not in any of the above formats, it must be an |
4687 | assertion. This may be a positive or negative lookahead or lookbehind | assertion. This may be a positive or negative lookahead or lookbehind |
4688 | assertion. Consider this pattern, again containing non-significant | assertion. Consider this pattern, again containing non-significant |
4689 | white space, and with the two alternatives on the second line: | white space, and with the two alternatives on the second line: |
4690 | ||
4691 | (?(?=[^a-z]*[a-z]) | (?(?=[^a-z]*[a-z]) |
4692 | \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) | \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
4693 | ||
4694 | The condition is a positive lookahead assertion that matches an | The condition is a positive lookahead assertion that matches an |
4695 | optional sequence of non-letters followed by a letter. In other words, | optional sequence of non-letters followed by a letter. In other words, |
4696 | it tests for the presence of at least one letter in the subject. If a | it tests for the presence of at least one letter in the subject. If a |
4697 | letter is found, the subject is matched against the first alternative; | letter is found, the subject is matched against the first alternative; |
4698 | otherwise it is matched against the second. This pattern matches | otherwise it is matched against the second. This pattern matches |
4699 | strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are | strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
4700 | letters and dd are digits. | letters and dd are digits. |
4701 | ||
4702 | ||
4703 | COMMENTS | COMMENTS |
4704 | ||
4705 | The sequence (?# marks the start of a comment that continues up to the | The sequence (?# marks the start of a comment that continues up to the |
4706 | next closing parenthesis. Nested parentheses are not permitted. The | next closing parenthesis. Nested parentheses are not permitted. The |
4707 | characters that make up a comment play no part in the pattern matching | characters that make up a comment play no part in the pattern matching |
4708 | at all. | at all. |
4709 | ||
4710 | If the PCRE_EXTENDED option is set, an unescaped # character outside a | If the PCRE_EXTENDED option is set, an unescaped # character outside a |
4711 | character class introduces a comment that continues to immediately | character class introduces a comment that continues to immediately |
4712 | after the next newline in the pattern. | after the next newline in the pattern. |
4713 | ||
4714 | ||
4715 | RECURSIVE PATTERNS | RECURSIVE PATTERNS |
4716 | ||
4717 | Consider the problem of matching a string in parentheses, allowing for | Consider the problem of matching a string in parentheses, allowing for |
4718 | unlimited nested parentheses. Without the use of recursion, the best | unlimited nested parentheses. Without the use of recursion, the best |
4719 | that can be done is to use a pattern that matches up to some fixed | that can be done is to use a pattern that matches up to some fixed |
4720 | depth of nesting. It is not possible to handle an arbitrary nesting | depth of nesting. It is not possible to handle an arbitrary nesting |
4721 | depth. | depth. |
4722 | ||
4723 | For some time, Perl has provided a facility that allows regular expres- | For some time, Perl has provided a facility that allows regular expres- |
4724 | sions to recurse (amongst other things). It does this by interpolating | sions to recurse (amongst other things). It does this by interpolating |
4725 | Perl code in the expression at run time, and the code can refer to the | Perl code in the expression at run time, and the code can refer to the |
4726 | expression itself. A Perl pattern using code interpolation to solve the | expression itself. A Perl pattern using code interpolation to solve the |
4727 | parentheses problem can be created like this: | parentheses problem can be created like this: |
4728 | ||
# | Line 4647 RECURSIVE PATTERNS | Line 4732 RECURSIVE PATTERNS |
4732 | refers recursively to the pattern in which it appears. | refers recursively to the pattern in which it appears. |
4733 | ||
4734 | Obviously, PCRE cannot support the interpolation of Perl code. Instead, | Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
4735 | it supports special syntax for recursion of the entire pattern, and | it supports special syntax for recursion of the entire pattern, and |
4736 | also for individual subpattern recursion. After its introduction in | also for individual subpattern recursion. After its introduction in |
4737 | PCRE and Python, this kind of recursion was introduced into Perl at | PCRE and Python, this kind of recursion was subsequently introduced |
4738 | release 5.10. | into Perl at release 5.10. |
4739 | ||
4740 | A special item that consists of (? followed by a number greater than | A special item that consists of (? followed by a number greater than |
4741 | zero and a closing parenthesis is a recursive call of the subpattern of | zero and a closing parenthesis is a recursive call of the subpattern of |
4742 | the given number, provided that it occurs inside that subpattern. (If | the given number, provided that it occurs inside that subpattern. (If |
4743 | not, it is a "subroutine" call, which is described in the next sec- | not, it is a "subroutine" call, which is described in the next sec- |
4744 | tion.) The special item (?R) or (?0) is a recursive call of the entire | tion.) The special item (?R) or (?0) is a recursive call of the entire |
4745 | regular expression. | regular expression. |
4746 | ||
In PCRE (like Python, but unlike Perl), a recursive subpattern call is | ||
always treated as an atomic group. That is, once it has matched some of | ||
the subject string, it is never re-entered, even if it contains untried | ||
alternatives and there is a subsequent matching failure. | ||
4747 | This PCRE pattern solves the nested parentheses problem (assume the | This PCRE pattern solves the nested parentheses problem (assume the |
4748 | PCRE_EXTENDED option is set so that white space is ignored): | PCRE_EXTENDED option is set so that white space is ignored): |
4749 | ||
# | Line 4752 RECURSIVE PATTERNS | Line 4832 RECURSIVE PATTERNS |
4832 | two different alternatives for the recursive and non-recursive cases. | two different alternatives for the recursive and non-recursive cases. |
4833 | The (?R) item is the actual recursive call. | The (?R) item is the actual recursive call. |
4834 | ||
4835 | Recursion difference from Perl | |
4836 | ||
4837 | In PCRE (like Python, but unlike Perl), a recursive subpattern call is | |
4838 | always treated as an atomic group. That is, once it has matched some of | |
4839 | the subject string, it is never re-entered, even if it contains untried | |
4840 | alternatives and there is a subsequent matching failure. This can be | |
4841 | illustrated by the following pattern, which purports to match a palin- | |
4842 | dromic string that contains an odd number of characters (for example, | |
4843 | "a", "aba", "abcba", "abcdcba"): | |
4844 | ||
4845 | ^(.|(.)(?1)\2)$ | |
4846 | ||
4847 | The idea is that it either matches a single character, or two identical | |
4848 | characters surrounding a sub-palindrome. In Perl, this pattern works; | |
4849 | in PCRE it does not if the pattern is longer than three characters. | |
4850 | Consider the subject string "abcba": | |
4851 | ||
4852 | At the top level, the first character is matched, but as it is not at | |
4853 | the end of the string, the first alternative fails; the second alterna- | |
4854 | tive is taken and the recursion kicks in. The recursive call to subpat- | |
4855 | tern 1 successfully matches the next character ("b"). (Note that the | |
4856 | beginning and end of line tests are not part of the recursion). | |
4857 | ||
4858 | Back at the top level, the next character ("c") is compared with what | |
4859 | subpattern 2 matched, which was "a". This fails. Because the recursion | |
4860 | is treated as an atomic group, there are now no backtracking points, | |
4861 | and so the entire match fails. (Perl is able, at this point, to re- | |
4862 | enter the recursion and try the second alternative.) However, if the | |
4863 | pattern is written with the alternatives in the other order, things are | |
4864 | different: | |
4865 | ||
4866 | ^((.)(?1)\2|.)$ | |
4867 | ||
4868 | This time, the recursing alternative is tried first, and continues to | |
4869 | recurse until it runs out of characters, at which point the recursion | |
4870 | fails. But this time we do have another alternative to try at the | |
4871 | higher level. That is the big difference: in the previous case the | |
4872 | remaining alternative is at a deeper recursion level, which PCRE cannot | |
4873 | use. | |
4874 | ||
4875 | To change the pattern so that matches all palindromic strings, not just | |
4876 | those with an odd number of characters, it is tempting to change the | |
4877 | pattern to this: | |
4878 | ||
4879 | ^((.)(?1)\2|.?)$ | |
4880 | ||
4881 | Again, this works in Perl, but not in PCRE, and for the same reason. | |
4882 | When a deeper recursion has matched a single character, it cannot be | |
4883 | entered again in order to match an empty string. The solution is to | |
4884 | separate the two cases, and write out the odd and even cases as alter- | |
4885 | natives at the higher level: | |
4886 | ||
4887 | ^(?:((.)(?1)\2|)|((.)(?3)\4|.)) | |
4888 | ||
4889 | If you want to match typical palindromic phrases, the pattern has to | |
4890 | ignore all non-word characters, which can be done like this: | |
4891 | ||
4892 | ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+4|\W*+.\W*+))\W*+$ | |
4893 | ||
4894 | If run with the PCRE_CASELESS option, this pattern matches phrases such | |
4895 | as "A man, a plan, a canal: Panama!" and it works well in both PCRE and | |
4896 | Perl. Note the use of the possessive quantifier *+ to avoid backtrack- | |
4897 | ing into sequences of non-word characters. Without this, PCRE takes a | |
4898 | great deal longer (ten times or more) to match typical phrases, and | |
4899 | Perl takes so long that you think it has gone into a loop. | |
4900 | ||
4901 | ||
4902 | SUBPATTERNS AS SUBROUTINES | SUBPATTERNS AS SUBROUTINES |
4903 | ||
# | Line 4864 BACKTRACKING CONTROL | Line 5010 BACKTRACKING CONTROL |
5010 | (*FAIL), which behaves like a failing negative assertion, they cause an | (*FAIL), which behaves like a failing negative assertion, they cause an |
5011 | error if encountered by pcre_dfa_exec(). | error if encountered by pcre_dfa_exec(). |
5012 | ||
5013 | The new verbs make use of what was previously invalid syntax: an open- | If any of these verbs are used in an assertion subpattern, their effect |
5014 | is confined to that subpattern; it does not extend to the surrounding | |
5015 | pattern. Note that assertion subpatterns are processed as anchored at | |
5016 | the point where they are tested. | |
5017 | ||
5018 | The new verbs make use of what was previously invalid syntax: an open- | |
5019 | ing parenthesis followed by an asterisk. In Perl, they are generally of | ing parenthesis followed by an asterisk. In Perl, they are generally of |
5020 | the form (*VERB:ARG) but PCRE does not support the use of arguments, so | the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
5021 | its general form is just (*VERB). Any number of these verbs may occur | its general form is just (*VERB). Any number of these verbs may occur |
5022 | in a pattern. There are two kinds: | in a pattern. There are two kinds: |
5023 | ||
5024 | Verbs that act immediately | Verbs that act immediately |
# | Line 4876 BACKTRACKING CONTROL | Line 5027 BACKTRACKING CONTROL |
5027 | ||
5028 | (*ACCEPT) | (*ACCEPT) |
5029 | ||
5030 | This verb causes the match to end successfully, skipping the remainder | This verb causes the match to end successfully, skipping the remainder |
5031 | of the pattern. When inside a recursion, only the innermost pattern is | of the pattern. When inside a recursion, only the innermost pattern is |
5032 | ended immediately. PCRE differs from Perl in what happens if the | ended immediately. If the (*ACCEPT) is inside capturing parentheses, |
5033 | (*ACCEPT) is inside capturing parentheses. In Perl, the data so far is | the data so far is captured. (This feature was added to PCRE at release |
5034 | captured: in PCRE no data is captured. For example: | 8.00.) For example: |
5035 | ||
5036 | A(A|B(*ACCEPT)|C)D | A((?:A|B(*ACCEPT)|C)D) |
5037 | ||
5038 | This matches "AB", "AAD", or "ACD", but when it matches "AB", no data | This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
5039 | is captured. | tured by the outer parentheses. |
5040 | ||
5041 | (*FAIL) or (*F) | (*FAIL) or (*F) |
5042 | ||
5043 | This verb causes the match to fail, forcing backtracking to occur. It | This verb causes the match to fail, forcing backtracking to occur. It |
5044 | is equivalent to (?!) but easier to read. The Perl documentation notes | is equivalent to (?!) but easier to read. The Perl documentation notes |
5045 | that it is probably useful only when combined with (?{}) or (??{}). | that it is probably useful only when combined with (?{}) or (??{}). |
5046 | Those are, of course, Perl features that are not present in PCRE. The | Those are, of course, Perl features that are not present in PCRE. The |
5047 | nearest equivalent is the callout feature, as for example in this pat- | nearest equivalent is the callout feature, as for example in this pat- |
5048 | tern: | tern: |
5049 | ||
5050 | a+(?C)(*FAIL) | a+(?C)(*FAIL) |
5051 | ||
5052 | A match with the string "aaaa" always fails, but the callout is taken | A match with the string "aaaa" always fails, but the callout is taken |
5053 | before each backtrack happens (in this example, 10 times). | before each backtrack happens (in this example, 10 times). |
5054 | ||
5055 | Verbs that act after backtracking | Verbs that act after backtracking |
5056 | ||
5057 | The following verbs do nothing when they are encountered. Matching con- | The following verbs do nothing when they are encountered. Matching con- |
5058 | tinues with what follows, but if there is no subsequent match, a fail- | tinues with what follows, but if there is no subsequent match, a fail- |
5059 | ure is forced. The verbs differ in exactly what kind of failure | ure is forced. The verbs differ in exactly what kind of failure |
5060 | occurs. | occurs. |
5061 | ||
5062 | (*COMMIT) | (*COMMIT) |
5063 | ||
5064 | This verb causes the whole match to fail outright if the rest of the | This verb causes the whole match to fail outright if the rest of the |
5065 | pattern does not match. Even if the pattern is unanchored, no further | pattern does not match. Even if the pattern is unanchored, no further |
5066 | attempts to find a match by advancing the start point take place. Once | attempts to find a match by advancing the start point take place. Once |
5067 | (*COMMIT) has been passed, pcre_exec() is committed to finding a match | (*COMMIT) has been passed, pcre_exec() is committed to finding a match |
5068 | at the current starting point, or not at all. For example: | at the current starting point, or not at all. For example: |
5069 | ||
5070 | a+(*COMMIT)b | a+(*COMMIT)b |
5071 | ||
5072 | This matches "xxaab" but not "aacaab". It can be thought of as a kind | This matches "xxaab" but not "aacaab". It can be thought of as a kind |
5073 | of dynamic anchor, or "I've started, so I must finish." | of dynamic anchor, or "I've started, so I must finish." |
5074 | ||
5075 | (*PRUNE) | (*PRUNE) |
5076 | ||
5077 | This verb causes the match to fail at the current position if the rest | This verb causes the match to fail at the current position if the rest |
5078 | of the pattern does not match. If the pattern is unanchored, the normal | of the pattern does not match. If the pattern is unanchored, the normal |
5079 | "bumpalong" advance to the next starting character then happens. Back- | "bumpalong" advance to the next starting character then happens. Back- |
5080 | tracking can occur as usual to the left of (*PRUNE), or when matching | tracking can occur as usual to the left of (*PRUNE), or when matching |
5081 | to the right of (*PRUNE), but if there is no match to the right, back- | to the right of (*PRUNE), but if there is no match to the right, back- |
5082 | tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) | tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) |
5083 | is just an alternative to an atomic group or possessive quantifier, but | is just an alternative to an atomic group or possessive quantifier, but |
5084 | there are some uses of (*PRUNE) that cannot be expressed in any other | there are some uses of (*PRUNE) that cannot be expressed in any other |
5085 | way. | way. |
5086 | ||
5087 | (*SKIP) | (*SKIP) |
5088 | ||
5089 | This verb is like (*PRUNE), except that if the pattern is unanchored, | This verb is like (*PRUNE), except that if the pattern is unanchored, |
5090 | the "bumpalong" advance is not to the next character, but to the posi- | the "bumpalong" advance is not to the next character, but to the posi- |
5091 | tion in the subject where (*SKIP) was encountered. (*SKIP) signifies | tion in the subject where (*SKIP) was encountered. (*SKIP) signifies |
5092 | that whatever text was matched leading up to it cannot be part of a | that whatever text was matched leading up to it cannot be part of a |
5093 | successful match. Consider: | successful match. Consider: |
5094 | ||
5095 | a+(*SKIP)b | a+(*SKIP)b |
5096 | ||
5097 | If the subject is "aaaac...", after the first match attempt fails | If the subject is "aaaac...", after the first match attempt fails |
5098 | (starting at the first character in the string), the starting point | (starting at the first character in the string), the starting point |
5099 | skips on to start the next attempt at "c". Note that a possessive quan- | skips on to start the next attempt at "c". Note that a possessive quan- |
5100 | tifer does not have the same effect in this example; although it would | tifer does not have the same effect in this example; although it would |
5101 | suppress backtracking during the first match attempt, the second | suppress backtracking during the first match attempt, the second |
5102 | attempt would start at the second character instead of skipping on to | attempt would start at the second character instead of skipping on to |
5103 | "c". | "c". |
5104 | ||
5105 | (*THEN) | (*THEN) |
5106 | ||
5107 | This verb causes a skip to the next alternation if the rest of the pat- | This verb causes a skip to the next alternation if the rest of the pat- |
5108 | tern does not match. That is, it cancels pending backtracking, but only | tern does not match. That is, it cancels pending backtracking, but only |
5109 | within the current alternation. Its name comes from the observation | within the current alternation. Its name comes from the observation |
5110 | that it can be used for a pattern-based if-then-else block: | that it can be used for a pattern-based if-then-else block: |
5111 | ||
5112 | ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... | ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
5113 | ||
5114 | If the COND1 pattern matches, FOO is tried (and possibly further items | If the COND1 pattern matches, FOO is tried (and possibly further items |
5115 | after the end of the group if FOO succeeds); on failure the matcher | after the end of the group if FOO succeeds); on failure the matcher |
5116 | skips to the second alternative and tries COND2, without backtracking | skips to the second alternative and tries COND2, without backtracking |
5117 | into COND1. If (*THEN) is used outside of any alternation, it acts | into COND1. If (*THEN) is used outside of any alternation, it acts |
5118 | exactly like (*PRUNE). | exactly like (*PRUNE). |
5119 | ||
5120 | ||
# | Line 4981 AUTHOR | Line 5132 AUTHOR |
5132 | ||
5133 | REVISION | REVISION |
5134 | ||
5135 | Last updated: 19 April 2008 | Last updated: 18 September 2009 |
5136 | Copyright (c) 1997-2008 University of Cambridge. | Copyright (c) 1997-2009 University of Cambridge. |
5137 | ------------------------------------------------------------------------------ | ------------------------------------------------------------------------------ |
5138 | ||
5139 | ||
5140 | PCRESYNTAX(3) PCRESYNTAX(3) | PCRESYNTAX(3) PCRESYNTAX(3) |
5141 | ||
5142 | ||
# | Line 5094 GENERAL CATEGORY PROPERTY CODES FOR \p a | Line 5245 GENERAL CATEGORY PROPERTY CODES FOR \p a |
5245 | SCRIPT NAMES FOR \p AND \P | SCRIPT NAMES FOR \p AND \P |
5246 | ||
5247 | Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, | Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
5248 | Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, | Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cu- |
5249 | Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, | neiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, |
5250 | Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- | Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, |
5251 | gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, | Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi, |
5252 | Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, | Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam, |
5253 | Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, | Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian, |
5254 | Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, | Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurash- |
5255 | Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. | tra, Shavian, Sinhala, Sudanese, Syloti_Nagri, Syriac, Tagalog, Tag- |
5256 | banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, | |
5257 | Ugaritic, Vai, Yi. | |
5258 | ||
5259 | ||
5260 | CHARACTER CLASSES | CHARACTER CLASSES |
# | Line 5153 QUANTIFIERS | Line 5306 QUANTIFIERS |
5306 | ||
5307 | ANCHORS AND SIMPLE ASSERTIONS | ANCHORS AND SIMPLE ASSERTIONS |
5308 | ||
5309 | \b word boundary | \b word boundary (only ASCII letters recognized) |
5310 | \B not a word boundary | \B not a word boundary |
5311 | ^ start of subject | ^ start of subject |
5312 | also after internal newline in multiline mode | also after internal newline in multiline mode |
# | Line 5179 ALTERNATION | Line 5332 ALTERNATION |
5332 | ||
5333 | CAPTURING | CAPTURING |
5334 | ||
5335 | (...) capturing group | (...) capturing group |
5336 | (?<name>...) named capturing group (Perl) | (?<name>...) named capturing group (Perl) |
5337 | (?'name'...) named capturing group (Perl) | (?'name'...) named capturing group (Perl) |
5338 | (?P<name>...) named capturing group (Python) | (?P<name>...) named capturing group (Python) |
5339 | (?:...) non-capturing group | (?:...) non-capturing group |
5340 | (?|...) non-capturing group; reset group numbers for | (?|...) non-capturing group; reset group numbers for |
5341 | capturing groups in each alternative | capturing groups in each alternative |
5342 | ||
5343 | ||
5344 | ATOMIC GROUPS | ATOMIC GROUPS |
5345 | ||
5346 | (?>...) atomic, non-capturing group | (?>...) atomic, non-capturing group |
5347 | ||
5348 | ||
5349 | COMMENT | COMMENT |
5350 | ||
5351 | (?#....) comment (not nestable) | (?#....) comment (not nestable) |
5352 | ||
5353 | ||
5354 | OPTION SETTING | OPTION SETTING |
5355 | ||
5356 | (?i) caseless | (?i) caseless |
5357 | (?J) allow duplicate names | (?J) allow duplicate names |
5358 | (?m) multiline | (?m) multiline |
5359 | (?s) single line (dotall) | (?s) single line (dotall) |
5360 | (?U) default ungreedy (lazy) | (?U) default ungreedy (lazy) |
5361 | (?x) extended (ignore white space) | (?x) extended (ignore white space) |
5362 | (?-...) unset option(s) | (?-...) unset option(s) |
5363 | ||
5364 | The following is recognized only at the start of a pattern or after one | |
5365 | of the newline-setting options with similar syntax: | |
5366 | ||
5367 | (*UTF8) set UTF-8 mode | |
5368 | ||
5369 | ||
5370 | LOOKAHEAD AND LOOKBEHIND ASSERTIONS | LOOKAHEAD AND LOOKBEHIND ASSERTIONS |
5371 | ||
5372 | (?=...) positive look ahead | (?=...) positive look ahead |
5373 | (?!...) negative look ahead | (?!...) negative look ahead |
5374 | (?<=...) positive look behind | (?<=...) positive look behind |
5375 | (?<!...) negative look behind | (?<!...) negative look behind |
5376 | ||
5377 | Each top-level branch of a look behind must be of a fixed length. | Each top-level branch of a look behind must be of a fixed length. |
5378 | ||
5379 | ||
5380 | BACKREFERENCES | BACKREFERENCES |
5381 | ||
5382 | \n reference by number (can be ambiguous) | \n reference by number (can be ambiguous) |
5383 | \gn reference by number | \gn reference by number |
5384 | \g{n} reference by number | \g{n} reference by number |
5385 | \g{-n} relative reference by number | \g{-n} relative reference by number |
5386 | \k<name> reference by name (Perl) | \k<name> reference by name (Perl) |
5387 | \k'name' reference by name (Perl) | \k'name' reference by name (Perl) |
5388 | \g{name} reference by name (Perl) | \g{name} reference by name (Perl) |
5389 | \k{name} reference by name (.NET) | \k{name} reference by name (.NET) |
5390 | (?P=name) reference by name (Python) | (?P=name) reference by name (Python) |
5391 | ||
5392 | ||
5393 | SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) | SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) |
5394 | ||
5395 | (?R) recurse whole pattern | (?R) recurse whole pattern |
5396 | (?n) call subpattern by absolute number | (?n) call subpattern by absolute number |
5397 | (?+n) call subpattern by relative number | (?+n) call subpattern by relative number |
5398 | (?-n) call subpattern by relative number | (?-n) call subpattern by relative number |
5399 | (?&name) call subpattern by name (Perl) | (?&name) call subpattern by name (Perl) |
5400 | (?P>name) call subpattern by name (Python) | (?P>name) call subpattern by name (Python) |
5401 | \g<name> call subpattern by name (Oniguruma) | \g<name> call subpattern by name (Oniguruma) |
5402 | \g'name' call subpattern by name (Oniguruma) | \g'name' call subpattern by name (Oniguruma) |
5403 | \g<n> call subpattern by absolute number (Oniguruma) | \g<n> call subpattern by absolute number (Oniguruma) |
5404 | \g'n' call subpattern by absolute number (Oniguruma) | \g'n' call subpattern by absolute number (Oniguruma) |
5405 | \g<+n> call subpattern by relative number (PCRE extension) | \g<+n> call subpattern by relative number (PCRE extension) |
5406 | \g'+n' call subpattern by relative number (PCRE extension) | \g'+n' call subpattern by relative number (PCRE extension) |
5407 | \g<-n> call subpattern by relative number (PCRE extension) | \g<-n> call subpattern by relative number (PCRE extension) |
5408 | \g'-n' call subpattern by relative number (PCRE extension) | \g'-n' call subpattern by relative number (PCRE extension) |
5409 | ||
5410 | ||
5411 | CONDITIONAL PATTERNS | CONDITIONAL PATTERNS |
# | Line 5255 CONDITIONAL PATTERNS | Line 5413 CONDITIONAL PATTERNS |
5413 | (?(condition)yes-pattern) | (?(condition)yes-pattern) |
5414 | (?(condition)yes-pattern|no-pattern) | (?(condition)yes-pattern|no-pattern) |
5415 | ||
5416 | (?(n)... absolute reference condition | (?(n)... absolute reference condition |
5417 | (?(+n)... relative reference condition | (?(+n)... relative reference condition |
5418 | (?(-n)... relative reference condition | (?(-n)... relative reference condition |
5419 | (?(<name>)... named reference condition (Perl) | (?(<name>)... named reference condition (Perl) |
5420 | (?('name')... named reference condition (Perl) | (?('name')... named reference condition (Perl) |
5421 | (?(name)... named reference condition (PCRE) | (?(name)... named reference condition (PCRE) |
5422 | (?(R)... overall recursion condition | (?(R)... overall recursion condition |
5423 | (?(Rn)... specific group recursion condition | (?(Rn)... specific group recursion condition |
5424 | (?(R&name)... specific recursion condition | (?(R&name)... specific recursion condition |
5425 | (?(DEFINE)... define subpattern for reference | (?(DEFINE)... define subpattern for reference |
5426 | (?(assert)... assertion condition | (?(assert)... assertion condition |
5427 | ||
5428 | ||
5429 | BACKTRACKING CONTROL | BACKTRACKING CONTROL |
5430 | ||
5431 | The following act immediately they are reached: | The following act immediately they are reached: |
5432 | ||
5433 | (*ACCEPT) force successful match | (*ACCEPT) force successful match |
5434 | (*FAIL) force backtrack; synonym (*F) | (*FAIL) force backtrack; synonym (*F) |
5435 | ||
5436 | The following act only when a subsequent match failure causes a back- | The following act only when a subsequent match failure causes a back- |
5437 | track to reach them. They all force a match failure, but they differ in | track to reach them. They all force a match failure, but they differ in |
5438 | what happens afterwards. Those that advance the start-of-match point do | what happens afterwards. Those that advance the start-of-match point do |
5439 | so only if the pattern is not anchored. | so only if the pattern is not anchored. |
5440 | ||
5441 | (*COMMIT) overall failure, no advance of starting point | (*COMMIT) overall failure, no advance of starting point |
5442 | (*PRUNE) advance to next starting character | (*PRUNE) advance to next starting character |
5443 | (*SKIP) advance start to current matching position | (*SKIP) advance start to current matching position |
5444 | (*THEN) local failure, backtrack to next alternation | (*THEN) local failure, backtrack to next alternation |
5445 | ||
5446 | ||
5447 | NEWLINE CONVENTIONS | NEWLINE CONVENTIONS |
5448 | ||
5449 | These are recognized only at the very start of the pattern or after a | These are recognized only at the very start of the pattern or after a |
5450 | (*BSR_...) option. | (*BSR_...) or (*UTF8) option. |
5451 |