137 |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
138 |
ported. The available properties that can be tested are limited to the |
ported. The available properties that can be tested are limited to the |
139 |
general category properties such as Lu for an upper case letter or Nd |
general category properties such as Lu for an upper case letter or Nd |
140 |
for a decimal number. A full list is given in the pcrepattern documen- |
for a decimal number, the Unicode script names such as Arabic or Han, |
141 |
tation. The PCRE library is increased in size by about 90K when Unicode |
and the derived properties Any and L&. A full list is given in the |
142 |
property support is included. |
pcrepattern documentation. Only the short names for properties are sup- |
143 |
|
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
144 |
|
ter}, is not supported. Furthermore, in Perl, many properties may |
145 |
|
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
146 |
|
does not support this. |
147 |
|
|
148 |
The following comments apply when PCRE is running in UTF-8 mode: |
The following comments apply when PCRE is running in UTF-8 mode: |
149 |
|
|
159 |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
160 |
crash. |
crash. |
161 |
|
|
162 |
2. In a pattern, the escape sequence \x{...}, where the contents of the |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
163 |
braces is a string of hexadecimal digits, is interpreted as a UTF-8 |
two-byte UTF-8 character if the value is greater than 127. |
|
character whose code number is the given hexadecimal number, for exam- |
|
|
ple: \x{1234}. If a non-hexadecimal digit appears between the braces, |
|
|
the item is not recognized. This escape sequence can be used either as |
|
|
a literal, or within a character class. |
|
164 |
|
|
165 |
3. The original hexadecimal escape sequence, \xhh, matches a two-byte |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
|
UTF-8 character if the value is greater than 127. |
|
|
|
|
|
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
|
166 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
167 |
|
|
168 |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
169 |
gle byte. |
gle byte. |
170 |
|
|
171 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
172 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
173 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
174 |
|
|
175 |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
176 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
177 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
178 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
179 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
180 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
181 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
182 |
\p{Nd}. |
\p{Nd}. |
183 |
|
|
184 |
8. Similarly, characters that match the POSIX named character classes |
7. Similarly, characters that match the POSIX named character classes |
185 |
are all low-valued characters. |
are all low-valued characters. |
186 |
|
|
187 |
9. Case-insensitive matching applies only to characters whose values |
8. Case-insensitive matching applies only to characters whose values |
188 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
189 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
190 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
191 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
192 |
used only for characters with higher values. |
used only for characters with higher values. Even when Unicode property |
193 |
|
support is available, PCRE supports case-insensitive matching only when |
194 |
|
there is a one-to-one mapping between a letter's cases. There are a |
195 |
|
small number of many-to-one mappings in Unicode; these are not sup- |
196 |
|
ported by PCRE. |
197 |
|
|
198 |
|
|
199 |
AUTHOR |
AUTHOR |
202 |
University Computing Service, |
University Computing Service, |
203 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
204 |
|
|
205 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
206 |
so I've taken it away. If you want to email me, use my initial and sur- |
so I've taken it away. If you want to email me, use my initial and sur- |
207 |
name, separated by a dot, at the domain ucs.cam.ac.uk. |
name, separated by a dot, at the domain ucs.cam.ac.uk. |
208 |
|
|
209 |
Last updated: 07 March 2005 |
Last updated: 24 January 2006 |
210 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
211 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
212 |
|
|
213 |
|
|
809 |
internal matching function calls in a pcre_exec() execution. Further |
internal matching function calls in a pcre_exec() execution. Further |
810 |
details are given with pcre_exec() below. |
details are given with pcre_exec() below. |
811 |
|
|
812 |
|
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
813 |
|
|
814 |
|
The output is an integer that gives the default limit for the depth of |
815 |
|
recursion when calling the internal matching function in a pcre_exec() |
816 |
|
execution. Further details are given with pcre_exec() below. |
817 |
|
|
818 |
PCRE_CONFIG_STACKRECURSE |
PCRE_CONFIG_STACKRECURSE |
819 |
|
|
820 |
The output is an integer that is set to one if internal recursion when |
The output is an integer that is set to one if internal recursion when |
868 |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
869 |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
870 |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
871 |
sage. The offset from the start of the pattern to the character where |
sage. This is a static string that is part of the library. You must not |
872 |
the error was discovered is placed in the variable pointed to by |
try to free it. The offset from the start of the pattern to the charac- |
873 |
erroffset, which must not be NULL. If it is, an immediate error is |
ter where the error was discovered is placed in the variable pointed to |
874 |
|
by erroffset, which must not be NULL. If it is, an immediate error is |
875 |
given. |
given. |
876 |
|
|
877 |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
878 |
codeptr argument is not NULL, a non-zero error code number is returned |
codeptr argument is not NULL, a non-zero error code number is returned |
879 |
via this argument in the event of an error. This is in addition to the |
via this argument in the event of an error. This is in addition to the |
880 |
textual error message. Error codes and messages are listed below. |
textual error message. Error codes and messages are listed below. |
881 |
|
|
882 |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
883 |
character tables that are built when PCRE is compiled, using the |
character tables that are built when PCRE is compiled, using the |
884 |
default C locale. Otherwise, tableptr must be an address that is the |
default C locale. Otherwise, tableptr must be an address that is the |
885 |
result of a call to pcre_maketables(). This value is stored with the |
result of a call to pcre_maketables(). This value is stored with the |
886 |
compiled pattern, and used again by pcre_exec(), unless another table |
compiled pattern, and used again by pcre_exec(), unless another table |
887 |
pointer is passed to it. For more discussion, see the section on locale |
pointer is passed to it. For more discussion, see the section on locale |
888 |
support below. |
support below. |
889 |
|
|
890 |
This code fragment shows a typical straightforward call to pcre_com- |
This code fragment shows a typical straightforward call to pcre_com- |
891 |
pile(): |
pile(): |
892 |
|
|
893 |
pcre *re; |
pcre *re; |
900 |
&erroffset, /* for error offset */ |
&erroffset, /* for error offset */ |
901 |
NULL); /* use default character tables */ |
NULL); /* use default character tables */ |
902 |
|
|
903 |
The following names for option bits are defined in the pcre.h header |
The following names for option bits are defined in the pcre.h header |
904 |
file: |
file: |
905 |
|
|
906 |
PCRE_ANCHORED |
PCRE_ANCHORED |
907 |
|
|
908 |
If this bit is set, the pattern is forced to be "anchored", that is, it |
If this bit is set, the pattern is forced to be "anchored", that is, it |
909 |
is constrained to match only at the first matching point in the string |
is constrained to match only at the first matching point in the string |
910 |
that is being searched (the "subject string"). This effect can also be |
that is being searched (the "subject string"). This effect can also be |
911 |
achieved by appropriate constructs in the pattern itself, which is the |
achieved by appropriate constructs in the pattern itself, which is the |
912 |
only way to do it in Perl. |
only way to do it in Perl. |
913 |
|
|
914 |
PCRE_AUTO_CALLOUT |
PCRE_AUTO_CALLOUT |
915 |
|
|
916 |
If this bit is set, pcre_compile() automatically inserts callout items, |
If this bit is set, pcre_compile() automatically inserts callout items, |
917 |
all with number 255, before each pattern item. For discussion of the |
all with number 255, before each pattern item. For discussion of the |
918 |
callout facility, see the pcrecallout documentation. |
callout facility, see the pcrecallout documentation. |
919 |
|
|
920 |
PCRE_CASELESS |
PCRE_CASELESS |
921 |
|
|
922 |
If this bit is set, letters in the pattern match both upper and lower |
If this bit is set, letters in the pattern match both upper and lower |
923 |
case letters. It is equivalent to Perl's /i option, and it can be |
case letters. It is equivalent to Perl's /i option, and it can be |
924 |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
925 |
always understands the concept of case for characters whose values are |
always understands the concept of case for characters whose values are |
926 |
less than 128, so caseless matching is always possible. For characters |
less than 128, so caseless matching is always possible. For characters |
927 |
with higher values, the concept of case is supported if PCRE is com- |
with higher values, the concept of case is supported if PCRE is com- |
928 |
piled with Unicode property support, but not otherwise. If you want to |
piled with Unicode property support, but not otherwise. If you want to |
929 |
use caseless matching for characters 128 and above, you must ensure |
use caseless matching for characters 128 and above, you must ensure |
930 |
that PCRE is compiled with Unicode property support as well as with |
that PCRE is compiled with Unicode property support as well as with |
931 |
UTF-8 support. |
UTF-8 support. |
932 |
|
|
933 |
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
934 |
|
|
935 |
If this bit is set, a dollar metacharacter in the pattern matches only |
If this bit is set, a dollar metacharacter in the pattern matches only |
936 |
at the end of the subject string. Without this option, a dollar also |
at the end of the subject string. Without this option, a dollar also |
937 |
matches immediately before the final character if it is a newline (but |
matches immediately before the final character if it is a newline (but |
938 |
not before any other newlines). The PCRE_DOLLAR_ENDONLY option is |
not before any other newlines). The PCRE_DOLLAR_ENDONLY option is |
939 |
ignored if PCRE_MULTILINE is set. There is no equivalent to this option |
ignored if PCRE_MULTILINE is set. There is no equivalent to this option |
940 |
in Perl, and no way to set it within a pattern. |
in Perl, and no way to set it within a pattern. |
941 |
|
|
942 |
PCRE_DOTALL |
PCRE_DOTALL |
943 |
|
|
944 |
If this bit is set, a dot metacharater in the pattern matches all char- |
If this bit is set, a dot metacharater in the pattern matches all char- |
945 |
acters, including newlines. Without it, newlines are excluded. This |
acters, including newlines. Without it, newlines are excluded. This |
946 |
option is equivalent to Perl's /s option, and it can be changed within |
option is equivalent to Perl's /s option, and it can be changed within |
947 |
a pattern by a (?s) option setting. A negative class such as [^a] |
a pattern by a (?s) option setting. A negative class such as [^a] |
948 |
always matches a newline character, independent of the setting of this |
always matches a newline character, independent of the setting of this |
949 |
option. |
option. |
950 |
|
|
951 |
PCRE_EXTENDED |
PCRE_EXTENDED |
952 |
|
|
953 |
If this bit is set, whitespace data characters in the pattern are |
If this bit is set, whitespace data characters in the pattern are |
954 |
totally ignored except when escaped or inside a character class. White- |
totally ignored except when escaped or inside a character class. White- |
955 |
space does not include the VT character (code 11). In addition, charac- |
space does not include the VT character (code 11). In addition, charac- |
956 |
ters between an unescaped # outside a character class and the next new- |
ters between an unescaped # outside a character class and the next new- |
957 |
line character, inclusive, are also ignored. This is equivalent to |
line character, inclusive, are also ignored. This is equivalent to |
958 |
Perl's /x option, and it can be changed within a pattern by a (?x) |
Perl's /x option, and it can be changed within a pattern by a (?x) |
959 |
option setting. |
option setting. |
960 |
|
|
961 |
This option makes it possible to include comments inside complicated |
This option makes it possible to include comments inside complicated |
962 |
patterns. Note, however, that this applies only to data characters. |
patterns. Note, however, that this applies only to data characters. |
963 |
Whitespace characters may never appear within special character |
Whitespace characters may never appear within special character |
964 |
sequences in a pattern, for example within the sequence (?( which |
sequences in a pattern, for example within the sequence (?( which |
965 |
introduces a conditional subpattern. |
introduces a conditional subpattern. |
966 |
|
|
967 |
PCRE_EXTRA |
PCRE_EXTRA |
968 |
|
|
969 |
This option was invented in order to turn on additional functionality |
This option was invented in order to turn on additional functionality |
970 |
of PCRE that is incompatible with Perl, but it is currently of very |
of PCRE that is incompatible with Perl, but it is currently of very |
971 |
little use. When set, any backslash in a pattern that is followed by a |
little use. When set, any backslash in a pattern that is followed by a |
972 |
letter that has no special meaning causes an error, thus reserving |
letter that has no special meaning causes an error, thus reserving |
973 |
these combinations for future expansion. By default, as in Perl, a |
these combinations for future expansion. By default, as in Perl, a |
974 |
backslash followed by a letter with no special meaning is treated as a |
backslash followed by a letter with no special meaning is treated as a |
975 |
literal. There are at present no other features controlled by this |
literal. There are at present no other features controlled by this |
976 |
option. It can also be set by a (?X) option setting within a pattern. |
option. It can also be set by a (?X) option setting within a pattern. |
977 |
|
|
978 |
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
979 |
|
|
980 |
If this option is set, an unanchored pattern is required to match |
If this option is set, an unanchored pattern is required to match |
981 |
before or at the first newline character in the subject string, though |
before or at the first newline character in the subject string, though |
982 |
the matched text may continue over the newline. |
the matched text may continue over the newline. |
983 |
|
|
984 |
PCRE_MULTILINE |
PCRE_MULTILINE |
985 |
|
|
986 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
987 |
line of characters (even if it actually contains newlines). The "start |
line of characters (even if it actually contains newlines). The "start |
988 |
of line" metacharacter (^) matches only at the start of the string, |
of line" metacharacter (^) matches only at the start of the string, |
989 |
while the "end of line" metacharacter ($) matches only at the end of |
while the "end of line" metacharacter ($) matches only at the end of |
990 |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
991 |
is set). This is the same as Perl. |
is set). This is the same as Perl. |
992 |
|
|
993 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
994 |
constructs match immediately following or immediately before any new- |
constructs match immediately following or immediately before any new- |
995 |
line in the subject string, respectively, as well as at the very start |
line in the subject string, respectively, as well as at the very start |
996 |
and end. This is equivalent to Perl's /m option, and it can be changed |
and end. This is equivalent to Perl's /m option, and it can be changed |
997 |
within a pattern by a (?m) option setting. If there are no "\n" charac- |
within a pattern by a (?m) option setting. If there are no "\n" charac- |
998 |
ters in a subject string, or no occurrences of ^ or $ in a pattern, |
ters in a subject string, or no occurrences of ^ or $ in a pattern, |
999 |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
1000 |
|
|
1001 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
1002 |
|
|
1003 |
If this option is set, it disables the use of numbered capturing paren- |
If this option is set, it disables the use of numbered capturing paren- |
1004 |
theses in the pattern. Any opening parenthesis that is not followed by |
theses in the pattern. Any opening parenthesis that is not followed by |
1005 |
? behaves as if it were followed by ?: but named parentheses can still |
? behaves as if it were followed by ?: but named parentheses can still |
1006 |
be used for capturing (and they acquire numbers in the usual way). |
be used for capturing (and they acquire numbers in the usual way). |
1007 |
There is no equivalent of this option in Perl. |
There is no equivalent of this option in Perl. |
1008 |
|
|
1009 |
PCRE_UNGREEDY |
PCRE_UNGREEDY |
1010 |
|
|
1011 |
This option inverts the "greediness" of the quantifiers so that they |
This option inverts the "greediness" of the quantifiers so that they |
1012 |
are not greedy by default, but become greedy if followed by "?". It is |
are not greedy by default, but become greedy if followed by "?". It is |
1013 |
not compatible with Perl. It can also be set by a (?U) option setting |
not compatible with Perl. It can also be set by a (?U) option setting |
1014 |
within the pattern. |
within the pattern. |
1015 |
|
|
1016 |
PCRE_UTF8 |
PCRE_UTF8 |
1017 |
|
|
1018 |
This option causes PCRE to regard both the pattern and the subject as |
This option causes PCRE to regard both the pattern and the subject as |
1019 |
strings of UTF-8 characters instead of single-byte character strings. |
strings of UTF-8 characters instead of single-byte character strings. |
1020 |
However, it is available only when PCRE is built to include UTF-8 sup- |
However, it is available only when PCRE is built to include UTF-8 sup- |
1021 |
port. If not, the use of this option provokes an error. Details of how |
port. If not, the use of this option provokes an error. Details of how |
1022 |
this option changes the behaviour of PCRE are given in the section on |
this option changes the behaviour of PCRE are given in the section on |
1023 |
UTF-8 support in the main pcre page. |
UTF-8 support in the main pcre page. |
1024 |
|
|
1025 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
1026 |
|
|
1027 |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
1028 |
automatically checked. If an invalid UTF-8 sequence of bytes is found, |
automatically checked. If an invalid UTF-8 sequence of bytes is found, |
1029 |
pcre_compile() returns an error. If you already know that your pattern |
pcre_compile() returns an error. If you already know that your pattern |
1030 |
is valid, and you want to skip this check for performance reasons, you |
is valid, and you want to skip this check for performance reasons, you |
1031 |
can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of |
can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of |
1032 |
passing an invalid UTF-8 string as a pattern is undefined. It may cause |
passing an invalid UTF-8 string as a pattern is undefined. It may cause |
1033 |
your program to crash. Note that this option can also be passed to |
your program to crash. Note that this option can also be passed to |
1034 |
pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity check- |
pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity check- |
1035 |
ing of subject strings. |
ing of subject strings. |
1036 |
|
|
1037 |
|
|
1038 |
COMPILATION ERROR CODES |
COMPILATION ERROR CODES |
1039 |
|
|
1040 |
The following table lists the error codes than may be returned by |
The following table lists the error codes than may be returned by |
1041 |
pcre_compile2(), along with the error messages that may be returned by |
pcre_compile2(), along with the error messages that may be returned by |
1042 |
both compiling functions. |
both compiling functions. |
1043 |
|
|
1044 |
0 no error |
0 no error |
1096 |
pcre_extra *pcre_study(const pcre *code, int options |
pcre_extra *pcre_study(const pcre *code, int options |
1097 |
const char **errptr); |
const char **errptr); |
1098 |
|
|
1099 |
If a compiled pattern is going to be used several times, it is worth |
If a compiled pattern is going to be used several times, it is worth |
1100 |
spending more time analyzing it in order to speed up the time taken for |
spending more time analyzing it in order to speed up the time taken for |
1101 |
matching. The function pcre_study() takes a pointer to a compiled pat- |
matching. The function pcre_study() takes a pointer to a compiled pat- |
1102 |
tern as its first argument. If studying the pattern produces additional |
tern as its first argument. If studying the pattern produces additional |
1103 |
information that will help speed up matching, pcre_study() returns a |
information that will help speed up matching, pcre_study() returns a |
1104 |
pointer to a pcre_extra block, in which the study_data field points to |
pointer to a pcre_extra block, in which the study_data field points to |
1105 |
the results of the study. |
the results of the study. |
1106 |
|
|
1107 |
The returned value from pcre_study() can be passed directly to |
The returned value from pcre_study() can be passed directly to |
1108 |
pcre_exec(). However, a pcre_extra block also contains other fields |
pcre_exec(). However, a pcre_extra block also contains other fields |
1109 |
that can be set by the caller before the block is passed; these are |
that can be set by the caller before the block is passed; these are |
1110 |
described below in the section on matching a pattern. |
described below in the section on matching a pattern. |
1111 |
|
|
1112 |
If studying the pattern does not produce any additional information |
If studying the pattern does not produce any additional information |
1113 |
pcre_study() returns NULL. In that circumstance, if the calling program |
pcre_study() returns NULL. In that circumstance, if the calling program |
1114 |
wants to pass any of the other fields to pcre_exec(), it must set up |
wants to pass any of the other fields to pcre_exec(), it must set up |
1115 |
its own pcre_extra block. |
its own pcre_extra block. |
1116 |
|
|
1117 |
The second argument of pcre_study() contains option bits. At present, |
The second argument of pcre_study() contains option bits. At present, |
1118 |
no options are defined, and this argument should always be zero. |
no options are defined, and this argument should always be zero. |
1119 |
|
|
1120 |
The third argument for pcre_study() is a pointer for an error message. |
The third argument for pcre_study() is a pointer for an error message. |
1121 |
If studying succeeds (even if no data is returned), the variable it |
If studying succeeds (even if no data is returned), the variable it |
1122 |
points to is set to NULL. Otherwise it points to a textual error mes- |
points to is set to NULL. Otherwise it is set to point to a textual |
1123 |
sage. You should therefore test the error pointer for NULL after call- |
error message. This is a static string that is part of the library. You |
1124 |
ing pcre_study(), to be sure that it has run successfully. |
must not try to free it. You should test the error pointer for NULL |
1125 |
|
after calling pcre_study(), to be sure that it has run successfully. |
1126 |
|
|
1127 |
This is a typical call to pcre_study(): |
This is a typical call to pcre_study(): |
1128 |
|
|
1144 |
by character value. When running in UTF-8 mode, this applies only to |
by character value. When running in UTF-8 mode, this applies only to |
1145 |
characters with codes less than 128. Higher-valued codes never match |
characters with codes less than 128. Higher-valued codes never match |
1146 |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
1147 |
with Unicode character property support. |
with Unicode character property support. The use of locales with Uni- |
1148 |
|
code is discouraged. |
1149 |
|
|
1150 |
An internal set of tables is created in the default C locale when PCRE |
An internal set of tables is created in the default C locale when PCRE |
1151 |
is built. This is used when the final argument of pcre_compile() is |
is built. This is used when the final argument of pcre_compile() is |
1152 |
NULL, and is sufficient for many applications. An alternative set of |
NULL, and is sufficient for many applications. An alternative set of |
1153 |
tables can, however, be supplied. These may be created in a different |
tables can, however, be supplied. These may be created in a different |
1154 |
locale from the default. As more and more applications change to using |
locale from the default. As more and more applications change to using |
1155 |
Unicode, the need for this locale support is expected to die away. |
Unicode, the need for this locale support is expected to die away. |
1156 |
|
|
1157 |
External tables are built by calling the pcre_maketables() function, |
External tables are built by calling the pcre_maketables() function, |
1158 |
which has no arguments, in the relevant locale. The result can then be |
which has no arguments, in the relevant locale. The result can then be |
1159 |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
1160 |
example, to build and use tables that are appropriate for the French |
example, to build and use tables that are appropriate for the French |
1161 |
locale (where accented characters with values greater than 128 are |
locale (where accented characters with values greater than 128 are |
1162 |
treated as letters), the following code could be used: |
treated as letters), the following code could be used: |
1163 |
|
|
1164 |
setlocale(LC_CTYPE, "fr_FR"); |
setlocale(LC_CTYPE, "fr_FR"); |
1165 |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
1166 |
re = pcre_compile(..., tables); |
re = pcre_compile(..., tables); |
1167 |
|
|
1168 |
When pcre_maketables() runs, the tables are built in memory that is |
When pcre_maketables() runs, the tables are built in memory that is |
1169 |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
1170 |
that the memory containing the tables remains available for as long as |
that the memory containing the tables remains available for as long as |
1171 |
it is needed. |
it is needed. |
1172 |
|
|
1173 |
The pointer that is passed to pcre_compile() is saved with the compiled |
The pointer that is passed to pcre_compile() is saved with the compiled |
1174 |
pattern, and the same tables are used via this pointer by pcre_study() |
pattern, and the same tables are used via this pointer by pcre_study() |
1175 |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
1176 |
tern, compilation, studying and matching all happen in the same locale, |
tern, compilation, studying and matching all happen in the same locale, |
1177 |
but different patterns can be compiled in different locales. |
but different patterns can be compiled in different locales. |
1178 |
|
|
1179 |
It is possible to pass a table pointer or NULL (indicating the use of |
It is possible to pass a table pointer or NULL (indicating the use of |
1180 |
the internal tables) to pcre_exec(). Although not intended for this |
the internal tables) to pcre_exec(). Although not intended for this |
1181 |
purpose, this facility could be used to match a pattern in a different |
purpose, this facility could be used to match a pattern in a different |
1182 |
locale from the one in which it was compiled. Passing table pointers at |
locale from the one in which it was compiled. Passing table pointers at |
1183 |
run time is discussed below in the section on matching a pattern. |
run time is discussed below in the section on matching a pattern. |
1184 |
|
|
1188 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
1189 |
int what, void *where); |
int what, void *where); |
1190 |
|
|
1191 |
The pcre_fullinfo() function returns information about a compiled pat- |
The pcre_fullinfo() function returns information about a compiled pat- |
1192 |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
1193 |
less retained for backwards compability (and is documented below). |
less retained for backwards compability (and is documented below). |
1194 |
|
|
1195 |
The first argument for pcre_fullinfo() is a pointer to the compiled |
The first argument for pcre_fullinfo() is a pointer to the compiled |
1196 |
pattern. The second argument is the result of pcre_study(), or NULL if |
pattern. The second argument is the result of pcre_study(), or NULL if |
1197 |
the pattern was not studied. The third argument specifies which piece |
the pattern was not studied. The third argument specifies which piece |
1198 |
of information is required, and the fourth argument is a pointer to a |
of information is required, and the fourth argument is a pointer to a |
1199 |
variable to receive the data. The yield of the function is zero for |
variable to receive the data. The yield of the function is zero for |
1200 |
success, or one of the following negative numbers: |
success, or one of the following negative numbers: |
1201 |
|
|
1202 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
1204 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
1205 |
PCRE_ERROR_BADOPTION the value of what was invalid |
PCRE_ERROR_BADOPTION the value of what was invalid |
1206 |
|
|
1207 |
The "magic number" is placed at the start of each compiled pattern as |
The "magic number" is placed at the start of each compiled pattern as |
1208 |
an simple check against passing an arbitrary memory pointer. Here is a |
an simple check against passing an arbitrary memory pointer. Here is a |
1209 |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
1210 |
pattern: |
pattern: |
1211 |
|
|
1212 |
int rc; |
int rc; |
1217 |
PCRE_INFO_SIZE, /* what is required */ |
PCRE_INFO_SIZE, /* what is required */ |
1218 |
&length); /* where to put the data */ |
&length); /* where to put the data */ |
1219 |
|
|
1220 |
The possible values for the third argument are defined in pcre.h, and |
The possible values for the third argument are defined in pcre.h, and |
1221 |
are as follows: |
are as follows: |
1222 |
|
|
1223 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_BACKREFMAX |
1224 |
|
|
1225 |
Return the number of the highest back reference in the pattern. The |
Return the number of the highest back reference in the pattern. The |
1226 |
fourth argument should point to an int variable. Zero is returned if |
fourth argument should point to an int variable. Zero is returned if |
1227 |
there are no back references. |
there are no back references. |
1228 |
|
|
1229 |
PCRE_INFO_CAPTURECOUNT |
PCRE_INFO_CAPTURECOUNT |
1230 |
|
|
1231 |
Return the number of capturing subpatterns in the pattern. The fourth |
Return the number of capturing subpatterns in the pattern. The fourth |
1232 |
argument should point to an int variable. |
argument should point to an int variable. |
1233 |
|
|
1234 |
PCRE_INFO_DEFAULT_TABLES |
PCRE_INFO_DEFAULT_TABLES |
1235 |
|
|
1236 |
Return a pointer to the internal default character tables within PCRE. |
Return a pointer to the internal default character tables within PCRE. |
1237 |
The fourth argument should point to an unsigned char * variable. This |
The fourth argument should point to an unsigned char * variable. This |
1238 |
information call is provided for internal use by the pcre_study() func- |
information call is provided for internal use by the pcre_study() func- |
1239 |
tion. External callers can cause PCRE to use its internal tables by |
tion. External callers can cause PCRE to use its internal tables by |
1240 |
passing a NULL table pointer. |
passing a NULL table pointer. |
1241 |
|
|
1242 |
PCRE_INFO_FIRSTBYTE |
PCRE_INFO_FIRSTBYTE |
1243 |
|
|
1244 |
Return information about the first byte of any matched string, for a |
Return information about the first byte of any matched string, for a |
1245 |
non-anchored pattern. (This option used to be called |
non-anchored pattern. (This option used to be called |
1246 |
PCRE_INFO_FIRSTCHAR; the old name is still recognized for backwards |
PCRE_INFO_FIRSTCHAR; the old name is still recognized for backwards |
1247 |
compatibility.) |
compatibility.) |
1248 |
|
|
1249 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
1250 |
(cat|cow|coyote), it is returned in the integer pointed to by where. |
(cat|cow|coyote), it is returned in the integer pointed to by where. |
1251 |
Otherwise, if either |
Otherwise, if either |
1252 |
|
|
1253 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
1254 |
branch starts with "^", or |
branch starts with "^", or |
1255 |
|
|
1256 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
1257 |
set (if it were set, the pattern would be anchored), |
set (if it were set, the pattern would be anchored), |
1258 |
|
|
1259 |
-1 is returned, indicating that the pattern matches only at the start |
-1 is returned, indicating that the pattern matches only at the start |
1260 |
of a subject string or after any newline within the string. Otherwise |
of a subject string or after any newline within the string. Otherwise |
1261 |
-2 is returned. For anchored patterns, -2 is returned. |
-2 is returned. For anchored patterns, -2 is returned. |
1262 |
|
|
1263 |
PCRE_INFO_FIRSTTABLE |
PCRE_INFO_FIRSTTABLE |
1264 |
|
|
1265 |
If the pattern was studied, and this resulted in the construction of a |
If the pattern was studied, and this resulted in the construction of a |
1266 |
256-bit table indicating a fixed set of bytes for the first byte in any |
256-bit table indicating a fixed set of bytes for the first byte in any |
1267 |
matching string, a pointer to the table is returned. Otherwise NULL is |
matching string, a pointer to the table is returned. Otherwise NULL is |
1268 |
returned. The fourth argument should point to an unsigned char * vari- |
returned. The fourth argument should point to an unsigned char * vari- |
1269 |
able. |
able. |
1270 |
|
|
1271 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
1272 |
|
|
1273 |
Return the value of the rightmost literal byte that must exist in any |
Return the value of the rightmost literal byte that must exist in any |
1274 |
matched string, other than at its start, if such a byte has been |
matched string, other than at its start, if such a byte has been |
1275 |
recorded. The fourth argument should point to an int variable. If there |
recorded. The fourth argument should point to an int variable. If there |
1276 |
is no such byte, -1 is returned. For anchored patterns, a last literal |
is no such byte, -1 is returned. For anchored patterns, a last literal |
1277 |
byte is recorded only if it follows something of variable length. For |
byte is recorded only if it follows something of variable length. For |
1278 |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
1279 |
/^a\dz\d/ the returned value is -1. |
/^a\dz\d/ the returned value is -1. |
1280 |
|
|
1282 |
PCRE_INFO_NAMEENTRYSIZE |
PCRE_INFO_NAMEENTRYSIZE |
1283 |
PCRE_INFO_NAMETABLE |
PCRE_INFO_NAMETABLE |
1284 |
|
|
1285 |
PCRE supports the use of named as well as numbered capturing parenthe- |
PCRE supports the use of named as well as numbered capturing parenthe- |
1286 |
ses. The names are just an additional way of identifying the parenthe- |
ses. The names are just an additional way of identifying the parenthe- |
1287 |
ses, which still acquire numbers. A convenience function called |
ses, which still acquire numbers. A convenience function called |
1288 |
pcre_get_named_substring() is provided for extracting an individual |
pcre_get_named_substring() is provided for extracting an individual |
1289 |
captured substring by name. It is also possible to extract the data |
captured substring by name. It is also possible to extract the data |
1290 |
directly, by first converting the name to a number in order to access |
directly, by first converting the name to a number in order to access |
1291 |
the correct pointers in the output vector (described with pcre_exec() |
the correct pointers in the output vector (described with pcre_exec() |
1292 |
below). To do the conversion, you need to use the name-to-number map, |
below). To do the conversion, you need to use the name-to-number map, |
1293 |
which is described by these three values. |
which is described by these three values. |
1294 |
|
|
1295 |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
1296 |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
1297 |
of each entry; both of these return an int value. The entry size |
of each entry; both of these return an int value. The entry size |
1298 |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
1299 |
a pointer to the first entry of the table (a pointer to char). The |
a pointer to the first entry of the table (a pointer to char). The |
1300 |
first two bytes of each entry are the number of the capturing parenthe- |
first two bytes of each entry are the number of the capturing parenthe- |
1301 |
sis, most significant byte first. The rest of the entry is the corre- |
sis, most significant byte first. The rest of the entry is the corre- |
1302 |
sponding name, zero terminated. The names are in alphabetical order. |
sponding name, zero terminated. The names are in alphabetical order. |
1303 |
For example, consider the following pattern (assume PCRE_EXTENDED is |
For example, consider the following pattern (assume PCRE_EXTENDED is |
1304 |
set, so white space - including newlines - is ignored): |
set, so white space - including newlines - is ignored): |
1305 |
|
|
1306 |
(?P<date> (?P<year>(\d\d)?\d\d) - |
(?P<date> (?P<year>(\d\d)?\d\d) - |
1307 |
(?P<month>\d\d) - (?P<day>\d\d) ) |
(?P<month>\d\d) - (?P<day>\d\d) ) |
1308 |
|
|
1309 |
There are four named subpatterns, so the table has four entries, and |
There are four named subpatterns, so the table has four entries, and |
1310 |
each entry in the table is eight bytes long. The table is as follows, |
each entry in the table is eight bytes long. The table is as follows, |
1311 |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
1312 |
as ??: |
as ??: |
1313 |
|
|
1316 |
00 04 m o n t h 00 |
00 04 m o n t h 00 |
1317 |
00 02 y e a r 00 ?? |
00 02 y e a r 00 ?? |
1318 |
|
|
1319 |
When writing code to extract data from named subpatterns using the |
When writing code to extract data from named subpatterns using the |
1320 |
name-to-number map, remember that the length of each entry is likely to |
name-to-number map, remember that the length of each entry is likely to |
1321 |
be different for each compiled pattern. |
be different for each compiled pattern. |
1322 |
|
|
1323 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
1324 |
|
|
1325 |
Return a copy of the options with which the pattern was compiled. The |
Return a copy of the options with which the pattern was compiled. The |
1326 |
fourth argument should point to an unsigned long int variable. These |
fourth argument should point to an unsigned long int variable. These |
1327 |
option bits are those specified in the call to pcre_compile(), modified |
option bits are those specified in the call to pcre_compile(), modified |
1328 |
by any top-level option settings within the pattern itself. |
by any top-level option settings within the pattern itself. |
1329 |
|
|
1330 |
A pattern is automatically anchored by PCRE if all of its top-level |
A pattern is automatically anchored by PCRE if all of its top-level |
1331 |
alternatives begin with one of the following: |
alternatives begin with one of the following: |
1332 |
|
|
1333 |
^ unless PCRE_MULTILINE is set |
^ unless PCRE_MULTILINE is set |
1341 |
|
|
1342 |
PCRE_INFO_SIZE |
PCRE_INFO_SIZE |
1343 |
|
|
1344 |
Return the size of the compiled pattern, that is, the value that was |
Return the size of the compiled pattern, that is, the value that was |
1345 |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
1346 |
which to place the compiled data. The fourth argument should point to a |
which to place the compiled data. The fourth argument should point to a |
1347 |
size_t variable. |
size_t variable. |
1349 |
PCRE_INFO_STUDYSIZE |
PCRE_INFO_STUDYSIZE |
1350 |
|
|
1351 |
Return the size of the data block pointed to by the study_data field in |
Return the size of the data block pointed to by the study_data field in |
1352 |
a pcre_extra block. That is, it is the value that was passed to |
a pcre_extra block. That is, it is the value that was passed to |
1353 |
pcre_malloc() when PCRE was getting memory into which to place the data |
pcre_malloc() when PCRE was getting memory into which to place the data |
1354 |
created by pcre_study(). The fourth argument should point to a size_t |
created by pcre_study(). The fourth argument should point to a size_t |
1355 |
variable. |
variable. |
1356 |
|
|
1357 |
|
|
1359 |
|
|
1360 |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
1361 |
|
|
1362 |
The pcre_info() function is now obsolete because its interface is too |
The pcre_info() function is now obsolete because its interface is too |
1363 |
restrictive to return all the available data about a compiled pattern. |
restrictive to return all the available data about a compiled pattern. |
1364 |
New programs should use pcre_fullinfo() instead. The yield of |
New programs should use pcre_fullinfo() instead. The yield of |
1365 |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
1366 |
lowing negative numbers: |
lowing negative numbers: |
1367 |
|
|
1368 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
1369 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
1370 |
|
|
1371 |
If the optptr argument is not NULL, a copy of the options with which |
If the optptr argument is not NULL, a copy of the options with which |
1372 |
the pattern was compiled is placed in the integer it points to (see |
the pattern was compiled is placed in the integer it points to (see |
1373 |
PCRE_INFO_OPTIONS above). |
PCRE_INFO_OPTIONS above). |
1374 |
|
|
1375 |
If the pattern is not anchored and the firstcharptr argument is not |
If the pattern is not anchored and the firstcharptr argument is not |
1376 |
NULL, it is used to pass back information about the first character of |
NULL, it is used to pass back information about the first character of |
1377 |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
1378 |
|
|
1379 |
|
|
1381 |
|
|
1382 |
int pcre_refcount(pcre *code, int adjust); |
int pcre_refcount(pcre *code, int adjust); |
1383 |
|
|
1384 |
The pcre_refcount() function is used to maintain a reference count in |
The pcre_refcount() function is used to maintain a reference count in |
1385 |
the data block that contains a compiled pattern. It is provided for the |
the data block that contains a compiled pattern. It is provided for the |
1386 |
benefit of applications that operate in an object-oriented manner, |
benefit of applications that operate in an object-oriented manner, |
1387 |
where different parts of the application may be using the same compiled |
where different parts of the application may be using the same compiled |
1388 |
pattern, but you want to free the block when they are all done. |
pattern, but you want to free the block when they are all done. |
1389 |
|
|
1390 |
When a pattern is compiled, the reference count field is initialized to |
When a pattern is compiled, the reference count field is initialized to |
1391 |
zero. It is changed only by calling this function, whose action is to |
zero. It is changed only by calling this function, whose action is to |
1392 |
add the adjust value (which may be positive or negative) to it. The |
add the adjust value (which may be positive or negative) to it. The |
1393 |
yield of the function is the new value. However, the value of the count |
yield of the function is the new value. However, the value of the count |
1394 |
is constrained to lie between 0 and 65535, inclusive. If the new value |
is constrained to lie between 0 and 65535, inclusive. If the new value |
1395 |
is outside these limits, it is forced to the appropriate limit value. |
is outside these limits, it is forced to the appropriate limit value. |
1396 |
|
|
1397 |
Except when it is zero, the reference count is not correctly preserved |
Except when it is zero, the reference count is not correctly preserved |
1398 |
if a pattern is compiled on one host and then transferred to a host |
if a pattern is compiled on one host and then transferred to a host |
1399 |
whose byte-order is different. (This seems a highly unlikely scenario.) |
whose byte-order is different. (This seems a highly unlikely scenario.) |
1400 |
|
|
1401 |
|
|
1405 |
const char *subject, int length, int startoffset, |
const char *subject, int length, int startoffset, |
1406 |
int options, int *ovector, int ovecsize); |
int options, int *ovector, int ovecsize); |
1407 |
|
|
1408 |
The function pcre_exec() is called to match a subject string against a |
The function pcre_exec() is called to match a subject string against a |
1409 |
compiled pattern, which is passed in the code argument. If the pattern |
compiled pattern, which is passed in the code argument. If the pattern |
1410 |
has been studied, the result of the study should be passed in the extra |
has been studied, the result of the study should be passed in the extra |
1411 |
argument. This function is the main matching facility of the library, |
argument. This function is the main matching facility of the library, |
1412 |
and it operates in a Perl-like manner. For specialist use there is also |
and it operates in a Perl-like manner. For specialist use there is also |
1413 |
an alternative matching function, which is described below in the sec- |
an alternative matching function, which is described below in the sec- |
1414 |
tion about the pcre_dfa_exec() function. |
tion about the pcre_dfa_exec() function. |
1415 |
|
|
1416 |
In most applications, the pattern will have been compiled (and option- |
In most applications, the pattern will have been compiled (and option- |
1417 |
ally studied) in the same process that calls pcre_exec(). However, it |
ally studied) in the same process that calls pcre_exec(). However, it |
1418 |
is possible to save compiled patterns and study data, and then use them |
is possible to save compiled patterns and study data, and then use them |
1419 |
later in different processes, possibly even on different hosts. For a |
later in different processes, possibly even on different hosts. For a |
1420 |
discussion about this, see the pcreprecompile documentation. |
discussion about this, see the pcreprecompile documentation. |
1421 |
|
|
1422 |
Here is an example of a simple call to pcre_exec(): |
Here is an example of a simple call to pcre_exec(): |
1435 |
|
|
1436 |
Extra data for pcre_exec() |
Extra data for pcre_exec() |
1437 |
|
|
1438 |
If the extra argument is not NULL, it must point to a pcre_extra data |
If the extra argument is not NULL, it must point to a pcre_extra data |
1439 |
block. The pcre_study() function returns such a block (when it doesn't |
block. The pcre_study() function returns such a block (when it doesn't |
1440 |
return NULL), but you can also create one for yourself, and pass addi- |
return NULL), but you can also create one for yourself, and pass addi- |
1441 |
tional information in it. The fields in a pcre_extra block are as fol- |
tional information in it. The pcre_extra block contains the following |
1442 |
lows: |
fields (not necessarily in this order): |
1443 |
|
|
1444 |
unsigned long int flags; |
unsigned long int flags; |
1445 |
void *study_data; |
void *study_data; |
1446 |
unsigned long int match_limit; |
unsigned long int match_limit; |
1447 |
|
unsigned long int match_limit_recursion; |
1448 |
void *callout_data; |
void *callout_data; |
1449 |
const unsigned char *tables; |
const unsigned char *tables; |
1450 |
|
|
1451 |
The flags field is a bitmap that specifies which of the other fields |
The flags field is a bitmap that specifies which of the other fields |
1452 |
are set. The flag bits are: |
are set. The flag bits are: |
1453 |
|
|
1454 |
PCRE_EXTRA_STUDY_DATA |
PCRE_EXTRA_STUDY_DATA |
1455 |
PCRE_EXTRA_MATCH_LIMIT |
PCRE_EXTRA_MATCH_LIMIT |
1456 |
|
PCRE_EXTRA_MATCH_LIMIT_RECURSION |
1457 |
PCRE_EXTRA_CALLOUT_DATA |
PCRE_EXTRA_CALLOUT_DATA |
1458 |
PCRE_EXTRA_TABLES |
PCRE_EXTRA_TABLES |
1459 |
|
|
1460 |
Other flag bits should be set to zero. The study_data field is set in |
Other flag bits should be set to zero. The study_data field is set in |
1461 |
the pcre_extra block that is returned by pcre_study(), together with |
the pcre_extra block that is returned by pcre_study(), together with |
1462 |
the appropriate flag bit. You should not set this yourself, but you may |
the appropriate flag bit. You should not set this yourself, but you may |
1463 |
add to the block by setting the other fields and their corresponding |
add to the block by setting the other fields and their corresponding |
1464 |
flag bits. |
flag bits. |
1465 |
|
|
1466 |
The match_limit field provides a means of preventing PCRE from using up |
The match_limit field provides a means of preventing PCRE from using up |
1467 |
a vast amount of resources when running patterns that are not going to |
a vast amount of resources when running patterns that are not going to |
1468 |
match, but which have a very large number of possibilities in their |
match, but which have a very large number of possibilities in their |
1469 |
search trees. The classic example is the use of nested unlimited |
search trees. The classic example is the use of nested unlimited |
1470 |
repeats. |
repeats. |
1471 |
|
|
1472 |
Internally, PCRE uses a function called match() which it calls repeat- |
Internally, PCRE uses a function called match() which it calls repeat- |
1473 |
edly (sometimes recursively). The limit is imposed on the number of |
edly (sometimes recursively). The limit set by match_limit is imposed |
1474 |
times this function is called during a match, which has the effect of |
on the number of times this function is called during a match, which |
1475 |
limiting the amount of recursion and backtracking that can take place. |
has the effect of limiting the amount of backtracking that can take |
1476 |
For patterns that are not anchored, the count starts from zero for each |
place. For patterns that are not anchored, the count restarts from zero |
1477 |
position in the subject string. |
for each position in the subject string. |
1478 |
|
|
1479 |
The default limit for the library can be set when PCRE is built; the |
The default value for the limit can be set when PCRE is built; the |
1480 |
default default is 10 million, which handles all but the most extreme |
default default is 10 million, which handles all but the most extreme |
1481 |
cases. You can reduce the default by suppling pcre_exec() with a |
cases. You can override the default by suppling pcre_exec() with a |
1482 |
pcre_extra block in which match_limit is set to a smaller value, and |
pcre_extra block in which match_limit is set, and |
1483 |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
1484 |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
1485 |
|
|
1486 |
The pcre_callout field is used in conjunction with the "callout" fea- |
The match_limit_recursion field is similar to match_limit, but instead |
1487 |
|
of limiting the total number of times that match() is called, it limits |
1488 |
|
the depth of recursion. The recursion depth is a smaller number than |
1489 |
|
the total number of calls, because not all calls to match() are recur- |
1490 |
|
sive. This limit is of use only if it is set smaller than match_limit. |
1491 |
|
|
1492 |
|
Limiting the recursion depth limits the amount of stack that can be |
1493 |
|
used, or, when PCRE has been compiled to use memory on the heap instead |
1494 |
|
of the stack, the amount of heap memory that can be used. |
1495 |
|
|
1496 |
|
The default value for match_limit_recursion can be set when PCRE is |
1497 |
|
built; the default default is the same value as the default for |
1498 |
|
match_limit. You can override the default by suppling pcre_exec() with |
1499 |
|
a pcre_extra block in which match_limit_recursion is set, and |
1500 |
|
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
1501 |
|
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
1502 |
|
|
1503 |
|
The pcre_callout field is used in conjunction with the "callout" fea- |
1504 |
ture, which is described in the pcrecallout documentation. |
ture, which is described in the pcrecallout documentation. |
1505 |
|
|
1506 |
The tables field is used to pass a character tables pointer to |
The tables field is used to pass a character tables pointer to |
1507 |
pcre_exec(); this overrides the value that is stored with the compiled |
pcre_exec(); this overrides the value that is stored with the compiled |
1508 |
pattern. A non-NULL value is stored with the compiled pattern only if |
pattern. A non-NULL value is stored with the compiled pattern only if |
1509 |
custom tables were supplied to pcre_compile() via its tableptr argu- |
custom tables were supplied to pcre_compile() via its tableptr argu- |
1510 |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
1511 |
PCRE's internal tables to be used. This facility is helpful when re- |
PCRE's internal tables to be used. This facility is helpful when re- |
1512 |
using patterns that have been saved after compiling with an external |
using patterns that have been saved after compiling with an external |
1513 |
set of tables, because the external tables might be at a different |
set of tables, because the external tables might be at a different |
1514 |
address when pcre_exec() is called. See the pcreprecompile documenta- |
address when pcre_exec() is called. See the pcreprecompile documenta- |
1515 |
tion for a discussion of saving compiled patterns for later use. |
tion for a discussion of saving compiled patterns for later use. |
1516 |
|
|
1517 |
Option bits for pcre_exec() |
Option bits for pcre_exec() |
1518 |
|
|
1519 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
1520 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, |
1521 |
PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
1522 |
|
|
1523 |
PCRE_ANCHORED |
PCRE_ANCHORED |
1524 |
|
|
1525 |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
1526 |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
1527 |
turned out to be anchored by virtue of its contents, it cannot be made |
turned out to be anchored by virtue of its contents, it cannot be made |
1528 |
unachored at matching time. |
unachored at matching time. |
1529 |
|
|
1530 |
PCRE_NOTBOL |
PCRE_NOTBOL |
1531 |
|
|
1532 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
1533 |
the beginning of a line, so the circumflex metacharacter should not |
the beginning of a line, so the circumflex metacharacter should not |
1534 |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
1535 |
causes circumflex never to match. This option affects only the behav- |
causes circumflex never to match. This option affects only the behav- |
1536 |
iour of the circumflex metacharacter. It does not affect \A. |
iour of the circumflex metacharacter. It does not affect \A. |
1537 |
|
|
1538 |
PCRE_NOTEOL |
PCRE_NOTEOL |
1539 |
|
|
1540 |
This option specifies that the end of the subject string is not the end |
This option specifies that the end of the subject string is not the end |
1541 |
of a line, so the dollar metacharacter should not match it nor (except |
of a line, so the dollar metacharacter should not match it nor (except |
1542 |
in multiline mode) a newline immediately before it. Setting this with- |
in multiline mode) a newline immediately before it. Setting this with- |
1543 |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
1544 |
option affects only the behaviour of the dollar metacharacter. It does |
option affects only the behaviour of the dollar metacharacter. It does |
1545 |
not affect \Z or \z. |
not affect \Z or \z. |
1546 |
|
|
1547 |
PCRE_NOTEMPTY |
PCRE_NOTEMPTY |
1548 |
|
|
1549 |
An empty string is not considered to be a valid match if this option is |
An empty string is not considered to be a valid match if this option is |
1550 |
set. If there are alternatives in the pattern, they are tried. If all |
set. If there are alternatives in the pattern, they are tried. If all |
1551 |
the alternatives match the empty string, the entire match fails. For |
the alternatives match the empty string, the entire match fails. For |
1552 |
example, if the pattern |
example, if the pattern |
1553 |
|
|
1554 |
a?b? |
a?b? |
1555 |
|
|
1556 |
is applied to a string not beginning with "a" or "b", it matches the |
is applied to a string not beginning with "a" or "b", it matches the |
1557 |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
1558 |
match is not valid, so PCRE searches further into the string for occur- |
match is not valid, so PCRE searches further into the string for occur- |
1559 |
rences of "a" or "b". |
rences of "a" or "b". |
1560 |
|
|
1561 |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
1562 |
cial case of a pattern match of the empty string within its split() |
cial case of a pattern match of the empty string within its split() |
1563 |
function, and when using the /g modifier. It is possible to emulate |
function, and when using the /g modifier. It is possible to emulate |
1564 |
Perl's behaviour after matching a null string by first trying the match |
Perl's behaviour after matching a null string by first trying the match |
1565 |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
1566 |
if that fails by advancing the starting offset (see below) and trying |
if that fails by advancing the starting offset (see below) and trying |
1567 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
1568 |
this in the pcredemo.c sample program. |
this in the pcredemo.c sample program. |
1569 |
|
|
1570 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
1571 |
|
|
1572 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
1573 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
1574 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
1575 |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
1576 |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
1577 |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
1578 |
returned. |
returned. |
1579 |
|
|
1580 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
1581 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
1582 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
1583 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
1584 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
1585 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
1586 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
1587 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
1588 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
1589 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
1590 |
|
|
1591 |
PCRE_PARTIAL |
PCRE_PARTIAL |
1592 |
|
|
1593 |
This option turns on the partial matching feature. If the subject |
This option turns on the partial matching feature. If the subject |
1594 |
string fails to match the pattern, but at some point during the match- |
string fails to match the pattern, but at some point during the match- |
1595 |
ing process the end of the subject was reached (that is, the subject |
ing process the end of the subject was reached (that is, the subject |
1596 |
partially matches the pattern and the failure to match occurred only |
partially matches the pattern and the failure to match occurred only |
1597 |
because there were not enough subject characters), pcre_exec() returns |
because there were not enough subject characters), pcre_exec() returns |
1598 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
1599 |
used, there are restrictions on what may appear in the pattern. These |
used, there are restrictions on what may appear in the pattern. These |
1600 |
are discussed in the pcrepartial documentation. |
are discussed in the pcrepartial documentation. |
1601 |
|
|
1602 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
1603 |
|
|
1604 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
1605 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
1606 |
mode, the byte offset must point to the start of a UTF-8 character. |
mode, the byte offset must point to the start of a UTF-8 character. |
1607 |
Unlike the pattern string, the subject may contain binary zero bytes. |
Unlike the pattern string, the subject may contain binary zero bytes. |
1608 |
When the starting offset is zero, the search for a match starts at the |
When the starting offset is zero, the search for a match starts at the |
1609 |
beginning of the subject, and this is by far the most common case. |
beginning of the subject, and this is by far the most common case. |
1610 |
|
|
1611 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
1612 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
1613 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
1614 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
1615 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
1616 |
|
|
1617 |
\Biss\B |
\Biss\B |
1618 |
|
|
1619 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
1620 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
1621 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
1622 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
1623 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
1624 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
1625 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
1626 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
1627 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
1628 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
1629 |
|
|
1630 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
1631 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
1632 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
1633 |
subject. |
subject. |
1634 |
|
|
1635 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
1636 |
|
|
1637 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
1638 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
1639 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
1640 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
1641 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
1642 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
1643 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
1644 |
|
|
1645 |
Captured substrings are returned to the caller via a vector of integer |
Captured substrings are returned to the caller via a vector of integer |
1646 |
offsets whose address is passed in ovector. The number of elements in |
offsets whose address is passed in ovector. The number of elements in |
1647 |
the vector is passed in ovecsize, which must be a non-negative number. |
the vector is passed in ovecsize, which must be a non-negative number. |
1648 |
Note: this argument is NOT the size of ovector in bytes. |
Note: this argument is NOT the size of ovector in bytes. |
1649 |
|
|
1650 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
1651 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
1652 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
1653 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
1654 |
The length passed in ovecsize should always be a multiple of three. If |
The length passed in ovecsize should always be a multiple of three. If |
1655 |
it is not, it is rounded down. |
it is not, it is rounded down. |
1656 |
|
|
1657 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
1658 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
1659 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
1660 |
element of a pair is set to the offset of the first character in a sub- |
element of a pair is set to the offset of the first character in a sub- |
1661 |
string, and the second is set to the offset of the first character |
string, and the second is set to the offset of the first character |
1662 |
after the end of a substring. The first pair, ovector[0] and ovec- |
after the end of a substring. The first pair, ovector[0] and ovec- |
1663 |
tor[1], identify the portion of the subject string matched by the |
tor[1], identify the portion of the subject string matched by the |
1664 |
entire pattern. The next pair is used for the first capturing subpat- |
entire pattern. The next pair is used for the first capturing subpat- |
1665 |
tern, and so on. The value returned by pcre_exec() is the number of |
tern, and so on. The value returned by pcre_exec() is the number of |
1666 |
pairs that have been set. If there are no capturing subpatterns, the |
pairs that have been set. If there are no capturing subpatterns, the |
1667 |
return value from a successful match is 1, indicating that just the |
return value from a successful match is 1, indicating that just the |
1668 |
first pair of offsets has been set. |
first pair of offsets has been set. |
1669 |
|
|
1670 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
1671 |
substrings as separate strings. These are described in the following |
substrings as separate strings. These are described in the following |
1672 |
section. |
section. |
1673 |
|
|
1674 |
It is possible for an capturing subpattern number n+1 to match some |
It is possible for an capturing subpattern number n+1 to match some |
1675 |
part of the subject when subpattern n has not been used at all. For |
part of the subject when subpattern n has not been used at all. For |
1676 |
example, if the string "abc" is matched against the pattern (a|(z))(bc) |
example, if the string "abc" is matched against the pattern (a|(z))(bc) |
1677 |
subpatterns 1 and 3 are matched, but 2 is not. When this happens, both |
subpatterns 1 and 3 are matched, but 2 is not. When this happens, both |
1678 |
offset values corresponding to the unused subpattern are set to -1. |
offset values corresponding to the unused subpattern are set to -1. |
1679 |
|
|
1680 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
1681 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
1682 |
|
|
1683 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
1684 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
1685 |
function returns a value of zero. In particular, if the substring off- |
function returns a value of zero. In particular, if the substring off- |
1686 |
sets are not of interest, pcre_exec() may be called with ovector passed |
sets are not of interest, pcre_exec() may be called with ovector passed |
1687 |
as NULL and ovecsize as zero. However, if the pattern contains back |
as NULL and ovecsize as zero. However, if the pattern contains back |
1688 |
references and the ovector is not big enough to remember the related |
references and the ovector is not big enough to remember the related |
1689 |
substrings, PCRE has to get additional memory for use during matching. |
substrings, PCRE has to get additional memory for use during matching. |
1690 |
Thus it is usually advisable to supply an ovector. |
Thus it is usually advisable to supply an ovector. |
1691 |
|
|
1692 |
Note that pcre_info() can be used to find out how many capturing sub- |
Note that pcre_info() can be used to find out how many capturing sub- |
1693 |
patterns there are in a compiled pattern. The smallest size for ovector |
patterns there are in a compiled pattern. The smallest size for ovector |
1694 |
that will allow for n captured substrings, in addition to the offsets |
that will allow for n captured substrings, in addition to the offsets |
1695 |
of the substring matched by the whole pattern, is (n+1)*3. |
of the substring matched by the whole pattern, is (n+1)*3. |
1696 |
|
|
1697 |
Return values from pcre_exec() |
Return values from pcre_exec() |
1698 |
|
|
1699 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
1700 |
defined in the header file: |
defined in the header file: |
1701 |
|
|
1702 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
1705 |
|
|
1706 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
1707 |
|
|
1708 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
1709 |
ovecsize was not zero. |
ovecsize was not zero. |
1710 |
|
|
1711 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
1714 |
|
|
1715 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
1716 |
|
|
1717 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
1718 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
1719 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
1720 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
1721 |
gives when the magic number is not present. |
gives when the magic number is not present. |
1722 |
|
|
1723 |
PCRE_ERROR_UNKNOWN_NODE (-5) |
PCRE_ERROR_UNKNOWN_NODE (-5) |
1724 |
|
|
1725 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
1726 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
1727 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
1728 |
|
|
1729 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
1730 |
|
|
1731 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
1732 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
1733 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
1734 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
1735 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
1736 |
|
|
1737 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
1738 |
|
|
1739 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
1740 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
1741 |
returned by pcre_exec(). |
returned by pcre_exec(). |
1742 |
|
|
1743 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
1744 |
|
|
1745 |
The recursion and backtracking limit, as specified by the match_limit |
The backtracking limit, as specified by the match_limit field in a |
1746 |
field in a pcre_extra structure (or defaulted) was reached. See the |
pcre_extra structure (or defaulted) was reached. See the description |
1747 |
|
above. |
1748 |
|
|
1749 |
|
PCRE_ERROR_RECURSIONLIMIT (-21) |
1750 |
|
|
1751 |
|
The internal recursion limit, as specified by the match_limit_recursion |
1752 |
|
field in a pcre_extra structure (or defaulted) was reached. See the |
1753 |
description above. |
description above. |
1754 |
|
|
1755 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
1756 |
|
|
1757 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
1758 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
1759 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
1760 |
|
|
1761 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
1762 |
|
|
1763 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
1764 |
subject. |
subject. |
1765 |
|
|
1766 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
1767 |
|
|
1768 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
1769 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
1770 |
ter. |
ter. |
1771 |
|
|
1772 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
1773 |
|
|
1774 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
1775 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
1776 |
|
|
1777 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
1778 |
|
|
1779 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
1780 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
1781 |
documentation for details of partial matching. |
documentation for details of partial matching. |
1782 |
|
|
1783 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
1784 |
|
|
1785 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
1786 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
1787 |
|
|
1788 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
1789 |
|
|
1790 |
This error is given if the value of the ovecsize argument is negative. |
This error is given if the value of the ovecsize argument is negative. |
1791 |
|
|
1792 |
|
|
1793 |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
1803 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
1804 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
1805 |
|
|
1806 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
1807 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
1808 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
1809 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
1810 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
1811 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
1812 |
substrings. A substring that contains a binary zero is correctly |
substrings. A substring that contains a binary zero is correctly |
1813 |
extracted and has a further zero added on the end, but the result is |
extracted and has a further zero added on the end, but the result is |
1814 |
not, of course, a C string. |
not, of course, a C string. |
1815 |
|
|
1816 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
1817 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
1818 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
1819 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
1820 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
1821 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
1822 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
1823 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
1824 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
1825 |
|
|
1826 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
1827 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
1828 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
1829 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
1830 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
1831 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
1832 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
1833 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
1834 |
the terminating zero, or one of |
the terminating zero, or one of |
1835 |
|
|
1836 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
1837 |
|
|
1838 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
1839 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
1840 |
|
|
1841 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
1842 |
|
|
1843 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
1844 |
|
|
1845 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
1846 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
1847 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
1848 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
1849 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
1850 |
pointer. The yield of the function is zero if all went well, or |
pointer. The yield of the function is zero if all went well, or |
1851 |
|
|
1852 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
1853 |
|
|
1854 |
if the attempt to get the memory block failed. |
if the attempt to get the memory block failed. |
1855 |
|
|
1856 |
When any of these functions encounter a substring that is unset, which |
When any of these functions encounter a substring that is unset, which |
1857 |
can happen when capturing subpattern number n+1 matches some part of |
can happen when capturing subpattern number n+1 matches some part of |
1858 |
the subject, but subpattern n has not been used at all, they return an |
the subject, but subpattern n has not been used at all, they return an |
1859 |
empty string. This can be distinguished from a genuine zero-length sub- |
empty string. This can be distinguished from a genuine zero-length sub- |
1860 |
string by inspecting the appropriate offset in ovector, which is nega- |
string by inspecting the appropriate offset in ovector, which is nega- |
1861 |
tive for unset substrings. |
tive for unset substrings. |
1862 |
|
|
1863 |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
1864 |
string_list() can be used to free the memory returned by a previous |
string_list() can be used to free the memory returned by a previous |
1865 |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
1866 |
tively. They do nothing more than call the function pointed to by |
tively. They do nothing more than call the function pointed to by |
1867 |
pcre_free, which of course could be called directly from a C program. |
pcre_free, which of course could be called directly from a C program. |
1868 |
However, PCRE is used in some situations where it is linked via a spe- |
However, PCRE is used in some situations where it is linked via a spe- |
1869 |
cial interface to another programming language which cannot use |
cial interface to another programming language which cannot use |
1870 |
pcre_free directly; it is for these cases that the functions are pro- |
pcre_free directly; it is for these cases that the functions are pro- |
1871 |
vided. |
vided. |
1872 |
|
|
1873 |
|
|
1886 |
int stringcount, const char *stringname, |
int stringcount, const char *stringname, |
1887 |
const char **stringptr); |
const char **stringptr); |
1888 |
|
|
1889 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
1890 |
ber. For example, for this pattern |
ber. For example, for this pattern |
1891 |
|
|
1892 |
(a+)b(?P<xxx>\d+)... |
(a+)b(?P<xxx>\d+)... |
1893 |
|
|
1894 |
the number of the subpattern called "xxx" is 2. You can find the number |
the number of the subpattern called "xxx" is 2. You can find the number |
1895 |
from the name by calling pcre_get_stringnumber(). The first argument is |
from the name by calling pcre_get_stringnumber(). The first argument is |
1896 |
the compiled pattern, and the second is the name. The yield of the |
the compiled pattern, and the second is the name. The yield of the |
1897 |
function is the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if |
function is the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if |
1898 |
there is no subpattern of that name. |
there is no subpattern of that name. |
1899 |
|
|
1900 |
Given the number, you can extract the substring directly, or use one of |
Given the number, you can extract the substring directly, or use one of |
1901 |
the functions described in the previous section. For convenience, there |
the functions described in the previous section. For convenience, there |
1902 |
are also two functions that do the whole job. |
are also two functions that do the whole job. |
1903 |
|
|
1904 |
Most of the arguments of pcre_copy_named_substring() and |
Most of the arguments of pcre_copy_named_substring() and |
1905 |
pcre_get_named_substring() are the same as those for the similarly |
pcre_get_named_substring() are the same as those for the similarly |
1906 |
named functions that extract by number. As these are described in the |
named functions that extract by number. As these are described in the |
1907 |
previous section, they are not re-described here. There are just two |
previous section, they are not re-described here. There are just two |
1908 |
differences: |
differences: |
1909 |
|
|
1910 |
First, instead of a substring number, a substring name is given. Sec- |
First, instead of a substring number, a substring name is given. Sec- |
1911 |
ond, there is an extra argument, given at the start, which is a pointer |
ond, there is an extra argument, given at the start, which is a pointer |
1912 |
to the compiled pattern. This is needed in order to gain access to the |
to the compiled pattern. This is needed in order to gain access to the |
1913 |
name-to-number translation table. |
name-to-number translation table. |
1914 |
|
|
1915 |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
1916 |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
1917 |
ate. |
ate. |
1918 |
|
|
1919 |
|
|
1920 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
1921 |
|
|
1922 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
1923 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
1924 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
1925 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
1926 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
1927 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
1928 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
1929 |
tation. |
tation. |
1930 |
|
|
1931 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
1932 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
1933 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
1934 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
1935 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
1936 |
|
|
1937 |
|
|
1942 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
1943 |
int *workspace, int wscount); |
int *workspace, int wscount); |
1944 |
|
|
1945 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
1946 |
against a compiled pattern, using a "DFA" matching algorithm. This has |
against a compiled pattern, using a "DFA" matching algorithm. This has |
1947 |
different characteristics to the normal algorithm, and is not compati- |
different characteristics to the normal algorithm, and is not compati- |
1948 |
ble with Perl. Some of the features of PCRE patterns are not supported. |
ble with Perl. Some of the features of PCRE patterns are not supported. |
1949 |
Nevertheless, there are times when this kind of matching can be useful. |
Nevertheless, there are times when this kind of matching can be useful. |
1950 |
For a discussion of the two matching algorithms, see the pcrematching |
For a discussion of the two matching algorithms, see the pcrematching |
1951 |
documentation. |
documentation. |
1952 |
|
|
1953 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
1954 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
1955 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
1956 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
1957 |
repeated here. |
repeated here. |
1958 |
|
|
1959 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
1960 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
1961 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
1962 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
1963 |
lot of possible matches. |
lot of possible matches. |
1964 |
|
|
1965 |
Here is an example of a simple call to pcre_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
1966 |
|
|
1967 |
int rc; |
int rc; |
1968 |
int ovector[10]; |
int ovector[10]; |
1969 |
int wspace[20]; |
int wspace[20]; |
1970 |
rc = pcre_exec( |
rc = pcre_dfa_exec( |
1971 |
re, /* result of pcre_compile() */ |
re, /* result of pcre_compile() */ |
1972 |
NULL, /* we didn't study the pattern */ |
NULL, /* we didn't study the pattern */ |
1973 |
"some string", /* the subject string */ |
"some string", /* the subject string */ |
1981 |
|
|
1982 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
1983 |
|
|
1984 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
1985 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL, |
1986 |
PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL, |
PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL, |
1987 |
PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last three of |
PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last three of |
1988 |
these are the same as for pcre_exec(), so their description is not |
these are the same as for pcre_exec(), so their description is not |
1989 |
repeated here. |
repeated here. |
1990 |
|
|
1991 |
PCRE_PARTIAL |
PCRE_PARTIAL |
1992 |
|
|
1993 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
1994 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
1995 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
1996 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
1997 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
1998 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
1999 |
set as the first matching string. |
set as the first matching string. |
2000 |
|
|
2001 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
2002 |
|
|
2003 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
2004 |
stop as soon as it has found one match. Because of the way the DFA |
stop as soon as it has found one match. Because of the way the DFA |
2005 |
algorithm works, this is necessarily the shortest possible match at the |
algorithm works, this is necessarily the shortest possible match at the |
2006 |
first possible matching point in the subject string. |
first possible matching point in the subject string. |
2007 |
|
|
2008 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
2009 |
|
|
2010 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
2011 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
2012 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
2013 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
2014 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
2015 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
2016 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
2017 |
documentation. |
documentation. |
2018 |
|
|
2019 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
2020 |
|
|
2021 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
2022 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
2023 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
2024 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
2025 |
if the pattern |
if the pattern |
2026 |
|
|
2027 |
<.*> |
<.*> |
2036 |
<something> <something else> |
<something> <something else> |
2037 |
<something> <something else> <something further> |
<something> <something else> <something further> |
2038 |
|
|
2039 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
2040 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
2041 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
2042 |
the offset to the start, and the second is the offset to the end. All |
the offset to the start, and the second is the offset to the end. All |
2043 |
the strings have the same start offset. (Space could have been saved by |
the strings have the same start offset. (Space could have been saved by |
2044 |
giving this only once, but it was decided to retain some compatibility |
giving this only once, but it was decided to retain some compatibility |
2045 |
with the way pcre_exec() returns data, even though the meaning of the |
with the way pcre_exec() returns data, even though the meaning of the |
2046 |
strings is different.) |
strings is different.) |
2047 |
|
|
2048 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
2049 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
2050 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
2051 |
filled with the longest matches. |
filled with the longest matches. |
2052 |
|
|
2053 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
2054 |
|
|
2055 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
2056 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
2057 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
2058 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
2059 |
|
|
2060 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
2061 |
|
|
2062 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
2063 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
2064 |
reference. |
reference. |
2065 |
|
|
2066 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
2067 |
|
|
2068 |
This return is given if pcre_dfa_exec() encounters a condition item in |
This return is given if pcre_dfa_exec() encounters a condition item in |
2069 |
a pattern that uses a back reference for the condition. This is not |
a pattern that uses a back reference for the condition. This is not |
2070 |
supported. |
supported. |
2071 |
|
|
2072 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
2073 |
|
|
2074 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
2075 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
2076 |
(it is meaningless). |
(it is meaningless). |
2077 |
|
|
2078 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
2079 |
|
|
2080 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
2081 |
workspace vector. |
workspace vector. |
2082 |
|
|
2083 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
2084 |
|
|
2085 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
2086 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
2087 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
2088 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
2089 |
|
|
2090 |
Last updated: 16 May 2005 |
Last updated: 18 January 2006 |
2091 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
2092 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2093 |
|
|
2094 |
|
|
2264 |
handle regular expressions. The differences described here are with |
handle regular expressions. The differences described here are with |
2265 |
respect to Perl 5.8. |
respect to Perl 5.8. |
2266 |
|
|
2267 |
1. PCRE does not have full UTF-8 support. Details of what it does have |
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
2268 |
are given in the section on UTF-8 support in the main pcre page. |
of what it does have are given in the section on UTF-8 support in the |
2269 |
|
main pcre page. |
2270 |
|
|
2271 |
2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl |
2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl |
2272 |
permits them, but they do not mean what you might think. For example, |
permits them, but they do not mean what you might think. For example, |
2273 |
(?!a){3} does not assert that the next three characters are not "a". It |
(?!a){3} does not assert that the next three characters are not "a". It |
2274 |
just asserts that the next character is not "a" three times. |
just asserts that the next character is not "a" three times. |
2275 |
|
|
2276 |
3. Capturing subpatterns that occur inside negative lookahead asser- |
3. Capturing subpatterns that occur inside negative lookahead asser- |
2277 |
tions are counted, but their entries in the offsets vector are never |
tions are counted, but their entries in the offsets vector are never |
2278 |
set. Perl sets its numerical variables from any such patterns that are |
set. Perl sets its numerical variables from any such patterns that are |
2279 |
matched before the assertion fails to match something (thereby succeed- |
matched before the assertion fails to match something (thereby succeed- |
2280 |
ing), but only if the negative lookahead assertion contains just one |
ing), but only if the negative lookahead assertion contains just one |
2281 |
branch. |
branch. |
2282 |
|
|
2283 |
4. Though binary zero characters are supported in the subject string, |
4. Though binary zero characters are supported in the subject string, |
2284 |
they are not allowed in a pattern string because it is passed as a nor- |
they are not allowed in a pattern string because it is passed as a nor- |
2285 |
mal C string, terminated by zero. The escape sequence \0 can be used in |
mal C string, terminated by zero. The escape sequence \0 can be used in |
2286 |
the pattern to represent a binary zero. |
the pattern to represent a binary zero. |
2287 |
|
|
2288 |
5. The following Perl escape sequences are not supported: \l, \u, \L, |
5. The following Perl escape sequences are not supported: \l, \u, \L, |
2289 |
\U, and \N. In fact these are implemented by Perl's general string-han- |
\U, and \N. In fact these are implemented by Perl's general string-han- |
2290 |
dling and are not part of its pattern matching engine. If any of these |
dling and are not part of its pattern matching engine. If any of these |
2291 |
are encountered by PCRE, an error is generated. |
are encountered by PCRE, an error is generated. |
2292 |
|
|
2293 |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
2294 |
is built with Unicode character property support. The properties that |
is built with Unicode character property support. The properties that |
2295 |
can be tested with \p and \P are limited to the general category prop- |
can be tested with \p and \P are limited to the general category prop- |
2296 |
erties such as Lu and Nd. |
erties such as Lu and Nd, script names such as Greek or Han, and the |
2297 |
|
derived properties Any and L&. |
2298 |
|
|
2299 |
7. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
7. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
2300 |
ters in between are treated as literals. This is slightly different |
ters in between are treated as literals. This is slightly different |
2367 |
(n) The alternative matching function (pcre_dfa_exec()) matches in a |
(n) The alternative matching function (pcre_dfa_exec()) matches in a |
2368 |
different way and is not Perl-compatible. |
different way and is not Perl-compatible. |
2369 |
|
|
2370 |
Last updated: 28 February 2005 |
Last updated: 24 January 2006 |
2371 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
2372 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2373 |
|
|
2374 |
|
|
2514 |
\t tab (hex 09) |
\t tab (hex 09) |
2515 |
\ddd character with octal code ddd, or backreference |
\ddd character with octal code ddd, or backreference |
2516 |
\xhh character with hex code hh |
\xhh character with hex code hh |
2517 |
\x{hhh..} character with hex code hhh... (UTF-8 mode only) |
\x{hhh..} character with hex code hhh.. |
2518 |
|
|
2519 |
The precise effect of \cx is as follows: if x is a lower case letter, |
The precise effect of \cx is as follows: if x is a lower case letter, |
2520 |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
2522 |
becomes hex 7B. |
becomes hex 7B. |
2523 |
|
|
2524 |
After \x, from zero to two hexadecimal digits are read (letters can be |
After \x, from zero to two hexadecimal digits are read (letters can be |
2525 |
in upper or lower case). In UTF-8 mode, any number of hexadecimal dig- |
in upper or lower case). Any number of hexadecimal digits may appear |
2526 |
its may appear between \x{ and }, but the value of the character code |
between \x{ and }, but the value of the character code must be less |
2527 |
must be less than 2**31 (that is, the maximum hexadecimal value is |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is, |
2528 |
7FFFFFFF). If characters other than hexadecimal digits appear between |
the maximum hexadecimal value is 7FFFFFFF). If characters other than |
2529 |
\x{ and }, or if there is no terminating }, this form of escape is not |
hexadecimal digits appear between \x{ and }, or if there is no termi- |
2530 |
recognized. Instead, the initial \x will be interpreted as a basic |
nating }, this form of escape is not recognized. Instead, the initial |
2531 |
hexadecimal escape, with no following digits, giving a character whose |
\x will be interpreted as a basic hexadecimal escape, with no following |
2532 |
value is zero. |
digits, giving a character whose value is zero. |
2533 |
|
|
2534 |
Characters whose value is less than 256 can be defined by either of the |
Characters whose value is less than 256 can be defined by either of the |
2535 |
two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference |
two syntaxes for \x. There is no difference in the way they are han- |
2536 |
in the way they are handled. For example, \xdc is exactly the same as |
dled. For example, \xdc is exactly the same as \x{dc}. |
2537 |
\x{dc}. |
|
2538 |
|
After \0 up to two further octal digits are read. In both cases, if |
2539 |
After \0 up to two further octal digits are read. In both cases, if |
there are fewer than two digits, just those that are present are used. |
2540 |
there are fewer than two digits, just those that are present are used. |
Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL |
2541 |
Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL |
character (code value 7). Make sure you supply two digits after the |
2542 |
character (code value 7). Make sure you supply two digits after the |
initial zero if the pattern character that follows is itself an octal |
|
initial zero if the pattern character that follows is itself an octal |
|
2543 |
digit. |
digit. |
2544 |
|
|
2545 |
The handling of a backslash followed by a digit other than 0 is compli- |
The handling of a backslash followed by a digit other than 0 is compli- |
2546 |
cated. Outside a character class, PCRE reads it and any following dig- |
cated. Outside a character class, PCRE reads it and any following dig- |
2547 |
its as a decimal number. If the number is less than 10, or if there |
its as a decimal number. If the number is less than 10, or if there |
2548 |
have been at least that many previous capturing left parentheses in the |
have been at least that many previous capturing left parentheses in the |
2549 |
expression, the entire sequence is taken as a back reference. A |
expression, the entire sequence is taken as a back reference. A |
2550 |
description of how this works is given later, following the discussion |
description of how this works is given later, following the discussion |
2551 |
of parenthesized subpatterns. |
of parenthesized subpatterns. |
2552 |
|
|
2553 |
Inside a character class, or if the decimal number is greater than 9 |
Inside a character class, or if the decimal number is greater than 9 |
2554 |
and there have not been that many capturing subpatterns, PCRE re-reads |
and there have not been that many capturing subpatterns, PCRE re-reads |
2555 |
up to three octal digits following the backslash, and generates a sin- |
up to three octal digits following the backslash, and generates a sin- |
2556 |
gle byte from the least significant 8 bits of the value. Any subsequent |
gle byte from the least significant 8 bits of the value. Any subsequent |
2557 |
digits stand for themselves. For example: |
digits stand for themselves. For example: |
2558 |
|
|
2571 |
\81 is either a back reference, or a binary zero |
\81 is either a back reference, or a binary zero |
2572 |
followed by the two characters "8" and "1" |
followed by the two characters "8" and "1" |
2573 |
|
|
2574 |
Note that octal values of 100 or greater must not be introduced by a |
Note that octal values of 100 or greater must not be introduced by a |
2575 |
leading zero, because no more than three octal digits are ever read. |
leading zero, because no more than three octal digits are ever read. |
2576 |
|
|
2577 |
All the sequences that define a single byte value or a single UTF-8 |
All the sequences that define a single byte value or a single UTF-8 |
2578 |
character (in UTF-8 mode) can be used both inside and outside character |
character (in UTF-8 mode) can be used both inside and outside character |
2579 |
classes. In addition, inside a character class, the sequence \b is |
classes. In addition, inside a character class, the sequence \b is |
2580 |
interpreted as the backspace character (hex 08), and the sequence \X is |
interpreted as the backspace character (hex 08), and the sequence \X is |
2581 |
interpreted as the character "X". Outside a character class, these |
interpreted as the character "X". Outside a character class, these |
2582 |
sequences have different meanings (see below). |
sequences have different meanings (see below). |
2583 |
|
|
2584 |
Generic character types |
Generic character types |
2585 |
|
|
2586 |
The third use of backslash is for specifying generic character types. |
The third use of backslash is for specifying generic character types. |
2587 |
The following are always recognized: |
The following are always recognized: |
2588 |
|
|
2589 |
\d any decimal digit |
\d any decimal digit |
2594 |
\W any "non-word" character |
\W any "non-word" character |
2595 |
|
|
2596 |
Each pair of escape sequences partitions the complete set of characters |
Each pair of escape sequences partitions the complete set of characters |
2597 |
into two disjoint sets. Any given character matches one, and only one, |
into two disjoint sets. Any given character matches one, and only one, |
2598 |
of each pair. |
of each pair. |
2599 |
|
|
2600 |
These character type sequences can appear both inside and outside char- |
These character type sequences can appear both inside and outside char- |
2601 |
acter classes. They each match one character of the appropriate type. |
acter classes. They each match one character of the appropriate type. |
2602 |
If the current matching point is at the end of the subject string, all |
If the current matching point is at the end of the subject string, all |
2603 |
of them fail, since there is no character to match. |
of them fail, since there is no character to match. |
2604 |
|
|
2605 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
2606 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
2607 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). |
2608 |
|
|
2609 |
A "word" character is an underscore or any character less than 256 that |
A "word" character is an underscore or any character less than 256 that |
2610 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
2611 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
2612 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
2613 |
page). For example, in the "fr_FR" (French) locale, some character |
page). For example, in the "fr_FR" (French) locale, some character |
2614 |
codes greater than 128 are used for accented letters, and these are |
codes greater than 128 are used for accented letters, and these are |
2615 |
matched by \w. |
matched by \w. |
2616 |
|
|
2617 |
In UTF-8 mode, characters with values greater than 128 never match \d, |
In UTF-8 mode, characters with values greater than 128 never match \d, |
2618 |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
2619 |
code character property support is available. |
code character property support is available. The use of locales with |
2620 |
|
Unicode is discouraged. |
2621 |
|
|
2622 |
Unicode character properties |
Unicode character properties |
2623 |
|
|
2624 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
2625 |
tional escape sequences to match generic character types are available |
tional escape sequences to match character properties are available |
2626 |
when UTF-8 mode is selected. They are: |
when UTF-8 mode is selected. They are: |
2627 |
|
|
2628 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
2629 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
2630 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
2631 |
|
|
2632 |
The property names represented by xx above are limited to the Unicode |
The property names represented by xx above are limited to the Unicode |
2633 |
general category properties. Each character has exactly one such prop- |
script names, the general category properties, and "Any", which matches |
2634 |
erty, specified by a two-letter abbreviation. For compatibility with |
any character (including newline). Other properties such as "InMusical- |
2635 |
Perl, negation can be specified by including a circumflex between the |
Symbols" are not currently supported by PCRE. Note that \P{Any} does |
2636 |
opening brace and the property name. For example, \p{^Lu} is the same |
not match any characters, so always causes a match failure. |
2637 |
as \P{Lu}. |
|
2638 |
|
Sets of Unicode characters are defined as belonging to certain scripts. |
2639 |
If only one letter is specified with \p or \P, it includes all the |
A character from one of these sets can be matched using a script name. |
2640 |
properties that start with that letter. In this case, in the absence of |
For example: |
2641 |
negation, the curly brackets in the escape sequence are optional; these |
|
2642 |
two examples have the same effect: |
\p{Greek} |
2643 |
|
\P{Han} |
2644 |
|
|
2645 |
|
Those that are not part of an identified script are lumped together as |
2646 |
|
"Common". The current list of scripts is: |
2647 |
|
|
2648 |
|
Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, Buhid, Cana- |
2649 |
|
dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic, Deseret, |
2650 |
|
Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, |
2651 |
|
Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, |
2652 |
|
Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam, |
2653 |
|
Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya, |
2654 |
|
Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag- |
2655 |
|
banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
2656 |
|
Ugaritic, Yi. |
2657 |
|
|
2658 |
|
Each character has exactly one general category property, specified by |
2659 |
|
a two-letter abbreviation. For compatibility with Perl, negation can be |
2660 |
|
specified by including a circumflex between the opening brace and the |
2661 |
|
property name. For example, \p{^Lu} is the same as \P{Lu}. |
2662 |
|
|
2663 |
|
If only one letter is specified with \p or \P, it includes all the gen- |
2664 |
|
eral category properties that start with that letter. In this case, in |
2665 |
|
the absence of negation, the curly brackets in the escape sequence are |
2666 |
|
optional; these two examples have the same effect: |
2667 |
|
|
2668 |
\p{L} |
\p{L} |
2669 |
\pL |
\pL |
2670 |
|
|
2671 |
The following property codes are supported: |
The following general category property codes are supported: |
2672 |
|
|
2673 |
C Other |
C Other |
2674 |
Cc Control |
Cc Control |
2714 |
Zp Paragraph separator |
Zp Paragraph separator |
2715 |
Zs Space separator |
Zs Space separator |
2716 |
|
|
2717 |
Extended properties such as "Greek" or "InMusicalSymbols" are not sup- |
The special property L& is also supported: it matches a character that |
2718 |
ported by PCRE. |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
2719 |
|
classified as a modifier or "other". |
2720 |
|
|
2721 |
|
The long synonyms for these properties that Perl supports (such as |
2722 |
|
\p{Letter}) are not supported by PCRE. Nor is is permitted to prefix |
2723 |
|
any of these properties with "Is". |
2724 |
|
|
2725 |
|
No character that is in the Unicode table has the Cn (unassigned) prop- |
2726 |
|
erty. Instead, this property is assumed for any code point that is not |
2727 |
|
in the Unicode table. |
2728 |
|
|
2729 |
Specifying caseless matching does not affect these escape sequences. |
Specifying caseless matching does not affect these escape sequences. |
2730 |
For example, \p{Lu} always matches only upper case letters. |
For example, \p{Lu} always matches only upper case letters. |
3703 |
tion.) The special item (?R) is a recursive call of the entire regular |
tion.) The special item (?R) is a recursive call of the entire regular |
3704 |
expression. |
expression. |
3705 |
|
|
3706 |
For example, this PCRE pattern solves the nested parentheses problem |
A recursive subpattern call is always treated as an atomic group. That |
3707 |
(assume the PCRE_EXTENDED option is set so that white space is |
is, once it has matched some of the subject string, it is never re- |
3708 |
ignored): |
entered, even if it contains untried alternatives and there is a subse- |
3709 |
|
quent matching failure. |
3710 |
|
|
3711 |
|
This PCRE pattern solves the nested parentheses problem (assume the |
3712 |
|
PCRE_EXTENDED option is set so that white space is ignored): |
3713 |
|
|
3714 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
3715 |
|
|
3716 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
3717 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
3718 |
recursive match of the pattern itself (that is a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
3719 |
sized substring). Finally there is a closing parenthesis. |
sized substring). Finally there is a closing parenthesis. |
3720 |
|
|
3721 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
3799 |
two strings. Such references must, however, follow the subpattern to |
two strings. Such references must, however, follow the subpattern to |
3800 |
which they refer. |
which they refer. |
3801 |
|
|
3802 |
|
Like recursive subpatterns, a "subroutine" call is always treated as an |
3803 |
|
atomic group. That is, once it has matched some of the subject string, |
3804 |
|
it is never re-entered, even if it contains untried alternatives and |
3805 |
|
there is a subsequent matching failure. |
3806 |
|
|
3807 |
|
|
3808 |
CALLOUTS |
CALLOUTS |
3809 |
|
|
3810 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
3811 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
3812 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
3813 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
3814 |
tion. |
tion. |
3815 |
|
|
3816 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
3817 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
3818 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
3819 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
3820 |
all calling out. |
all calling out. |
3821 |
|
|
3822 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
3823 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
3824 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
3825 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
3826 |
points: |
points: |
3827 |
|
|
3828 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
3829 |
|
|
3830 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
3831 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
3832 |
numbered 255. |
numbered 255. |
3833 |
|
|
3834 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
3835 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
3836 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
3837 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
3838 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
3839 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
3840 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
3841 |
|
|
3842 |
Last updated: 28 February 2005 |
Last updated: 24 January 2006 |
3843 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
3844 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
3845 |
|
|
3846 |
|
|
3930 |
uses the date example quoted above: |
uses the date example quoted above: |
3931 |
|
|
3932 |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
3933 |
data> 25jun04P |
data> 25jun04\P |
3934 |
0: 25jun04 |
0: 25jun04 |
3935 |
1: jun |
1: jun |
3936 |
data> 25dec3P |
data> 25dec3\P |
3937 |
Partial match |
Partial match |
3938 |
data> 3juP |
data> 3ju\P |
3939 |
Partial match |
Partial match |
3940 |
data> 3jujP |
data> 3juj\P |
3941 |
No match |
No match |
3942 |
data> jP |
data> j\P |
3943 |
No match |
No match |
3944 |
|
|
3945 |
The first data string is matched completely, so pcretest shows the |
The first data string is matched completely, so pcretest shows the |
4029 |
Because of this phenomenon, it does not usually make sense to end a |
Because of this phenomenon, it does not usually make sense to end a |
4030 |
pattern that is going to be matched in this way with a variable repeat. |
pattern that is going to be matched in this way with a variable repeat. |
4031 |
|
|
4032 |
Last updated: 28 February 2005 |
4. Patterns that contain alternatives at the top level which do not all |
4033 |
Copyright (c) 1997-2005 University of Cambridge. |
start with the same pattern item may not work as expected. For example, |
4034 |
|
consider this pattern: |
4035 |
|
|
4036 |
|
1234|3789 |
4037 |
|
|
4038 |
|
If the first part of the subject is "ABC123", a partial match of the |
4039 |
|
first alternative is found at offset 3. There is no partial match for |
4040 |
|
the second alternative, because such a match does not start at the same |
4041 |
|
point in the subject string. Attempting to continue with the string |
4042 |
|
"789" does not yield a match because only those alternatives that match |
4043 |
|
at one point in the subject are remembered. The problem arises because |
4044 |
|
the start of the second alternative matches within the first alterna- |
4045 |
|
tive. There is no problem with anchored patterns or patterns such as: |
4046 |
|
|
4047 |
|
1234|ABCD |
4048 |
|
|
4049 |
|
where no string can be a partial match for both alternatives. |
4050 |
|
|
4051 |
|
Last updated: 16 January 2006 |
4052 |
|
Copyright (c) 1997-2006 University of Cambridge. |
4053 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4054 |
|
|
4055 |
|
|
4163 |
them for release 5.0. However, from now on, it should be possible to |
them for release 5.0. However, from now on, it should be possible to |
4164 |
make changes in a compatible manner. |
make changes in a compatible manner. |
4165 |
|
|
4166 |
Last updated: 28 February 2005 |
Notwithstanding the above, if you have any saved patterns in UTF-8 mode |
4167 |
Copyright (c) 1997-2005 University of Cambridge. |
that use \p or \P that were compiled with any release up to and includ- |
4168 |
|
ing 6.4, you will have to recompile them for release 6.5 and above. |
4169 |
|
|
4170 |
|
Last updated: 01 February 2006 |
4171 |
|
Copyright (c) 1997-2006 University of Cambridge. |
4172 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4173 |
|
|
4174 |
|
|
4293 |
functions call the native ones, it is also necessary to add -lpcre. |
functions call the native ones, it is also necessary to add -lpcre. |
4294 |
|
|
4295 |
I have implemented only those option bits that can be reasonably mapped |
I have implemented only those option bits that can be reasonably mapped |
4296 |
to PCRE native options. In addition, the options REG_EXTENDED and |
to PCRE native options. In addition, the option REG_EXTENDED is defined |
4297 |
REG_NOSUB are defined with the value zero. They have no effect, but |
with the value zero. This has no effect, but since programs that are |
4298 |
since programs that are written to the POSIX interface often use them, |
written to the POSIX interface often use it, this makes it easier to |
4299 |
this makes it easier to slot in PCRE as a replacement library. Other |
slot in PCRE as a replacement library. Other POSIX options are not even |
4300 |
POSIX options are not even defined. |
defined. |
4301 |
|
|
4302 |
When PCRE is called via these functions, it is only the API that is |
When PCRE is called via these functions, it is only the API that is |
4303 |
POSIX-like in style. The syntax and semantics of the regular expres- |
POSIX-like in style. The syntax and semantics of the regular expres- |
4322 |
form. The pattern is a C string terminated by a binary zero, and is |
form. The pattern is a C string terminated by a binary zero, and is |
4323 |
passed in the argument pattern. The preg argument is a pointer to a |
passed in the argument pattern. The preg argument is a pointer to a |
4324 |
regex_t structure that is used as a base for storing information about |
regex_t structure that is used as a base for storing information about |
4325 |
the compiled expression. |
the compiled regular expression. |
4326 |
|
|
4327 |
The argument cflags is either zero, or contains one or more of the bits |
The argument cflags is either zero, or contains one or more of the bits |
4328 |
defined by the following macros: |
defined by the following macros: |
4329 |
|
|
4330 |
REG_DOTALL |
REG_DOTALL |
4331 |
|
|
4332 |
The PCRE_DOTALL option is set when the expression is passed for compi- |
The PCRE_DOTALL option is set when the regular expression is passed for |
4333 |
lation to the native function. Note that REG_DOTALL is not part of the |
compilation to the native function. Note that REG_DOTALL is not part of |
4334 |
POSIX standard. |
the POSIX standard. |
4335 |
|
|
4336 |
REG_ICASE |
REG_ICASE |
4337 |
|
|
4338 |
The PCRE_CASELESS option is set when the expression is passed for com- |
The PCRE_CASELESS option is set when the regular expression is passed |
4339 |
pilation to the native function. |
for compilation to the native function. |
4340 |
|
|
4341 |
REG_NEWLINE |
REG_NEWLINE |
4342 |
|
|
4343 |
The PCRE_MULTILINE option is set when the expression is passed for com- |
The PCRE_MULTILINE option is set when the regular expression is passed |
4344 |
pilation to the native function. Note that this does not mimic the |
for compilation to the native function. Note that this does not mimic |
4345 |
defined POSIX behaviour for REG_NEWLINE (see the following section). |
the defined POSIX behaviour for REG_NEWLINE (see the following sec- |
4346 |
|
tion). |
4347 |
|
|
4348 |
|
REG_NOSUB |
4349 |
|
|
4350 |
|
The PCRE_NO_AUTO_CAPTURE option is set when the regular expression is |
4351 |
|
passed for compilation to the native function. In addition, when a pat- |
4352 |
|
tern that is compiled with this flag is passed to regexec() for match- |
4353 |
|
ing, the nmatch and pmatch arguments are ignored, and no captured |
4354 |
|
strings are returned. |
4355 |
|
|
4356 |
|
REG_UTF8 |
4357 |
|
|
4358 |
|
The PCRE_UTF8 option is set when the regular expression is passed for |
4359 |
|
compilation to the native function. This causes the pattern itself and |
4360 |
|
all data strings used for matching it to be treated as UTF-8 strings. |
4361 |
|
Note that REG_UTF8 is not part of the POSIX standard. |
4362 |
|
|
4363 |
In the absence of these flags, no options are passed to the native |
In the absence of these flags, no options are passed to the native |
4364 |
function. This means the the regex is compiled with PCRE default |
function. This means the the regex is compiled with PCRE default |
4425 |
The PCRE_NOTEOL option is set when calling the underlying PCRE matching |
The PCRE_NOTEOL option is set when calling the underlying PCRE matching |
4426 |
function. |
function. |
4427 |
|
|
4428 |
The portion of the string that was matched, and also any captured sub- |
If the pattern was compiled with the REG_NOSUB flag, no data about any |
4429 |
strings, are returned via the pmatch argument, which points to an array |
matched strings is returned. The nmatch and pmatch arguments of |
4430 |
of nmatch structures of type regmatch_t, containing the members rm_so |
regexec() are ignored. |
4431 |
and rm_eo. These contain the offset to the first character of each sub- |
|
4432 |
string and the offset to the first character after the end of each sub- |
Otherwise,the portion of the string that was matched, and also any cap- |
4433 |
string, respectively. The 0th element of the vector relates to the |
tured substrings, are returned via the pmatch argument, which points to |
4434 |
entire portion of string that was matched; subsequent elements relate |
an array of nmatch structures of type regmatch_t, containing the mem- |
4435 |
to the capturing subpatterns of the regular expression. Unused entries |
bers rm_so and rm_eo. These contain the offset to the first character |
4436 |
in the array have both structure members set to -1. |
of each substring and the offset to the first character after the end |
4437 |
|
of each substring, respectively. The 0th element of the vector relates |
4438 |
|
to the entire portion of string that was matched; subsequent elements |
4439 |
|
relate to the capturing subpatterns of the regular expression. Unused |
4440 |
|
entries in the array have both structure members set to -1. |
4441 |
|
|
4442 |
A successful match yields a zero return; various error codes are |
A successful match yields a zero return; various error codes are |
4443 |
defined in the header file, of which REG_NOMATCH is the "expected" |
defined in the header file, of which REG_NOMATCH is the "expected" |
4468 |
University Computing Service, |
University Computing Service, |
4469 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
4470 |
|
|
4471 |
Last updated: 28 February 2005 |
Last updated: 16 January 2006 |
4472 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
4473 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4474 |
|
|
4475 |
|
|
4642 |
|
|
4643 |
RE_Options & set_caseless(bool) |
RE_Options & set_caseless(bool) |
4644 |
|
|
4645 |
which sets or unsets the modifier. Moreover, PCRE_CONFIG_MATCH_LIMIT |
which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can |
4646 |
can be accessed through the set_match_limit() and match_limit() member |
be accessed through the set_match_limit() and match_limit() member |
4647 |
functions. Setting match_limit to a non-zero value will limit the exe- |
functions. Setting match_limit to a non-zero value will limit the exe- |
4648 |
cution of pcre to keep it from doing bad things like blowing the stack |
cution of pcre to keep it from doing bad things like blowing the stack |
4649 |
or taking an eternity to return a result. A value of 5000 is good |
or taking an eternity to return a result. A value of 5000 is good |
4650 |
enough to stop stack blowup in a 2MB thread stack. Setting match_limit |
enough to stop stack blowup in a 2MB thread stack. Setting match_limit |
4651 |
to zero disables match limiting. |
to zero disables match limiting. Alternatively, you can call |
4652 |
|
match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to |
4653 |
|
limit how much PCRE recurses. match_limit() limits the number of |
4654 |
|
matches PCRE does; match_limit_recursion() limits the depth of internal |
4655 |
|
recursion, and therefore the amount of stack that is used. |
4656 |
|
|
4657 |
Normally, to pass one or more modifiers to a RE class, you declare a |
Normally, to pass one or more modifiers to a RE class, you declare a |
4658 |
RE_Options object, set the appropriate options, and pass this object to |
RE_Options object, set the appropriate options, and pass this object to |