2122 |
|
|
2123 |
There are a number of optimizations that pcre_exec() uses at the start |
There are a number of optimizations that pcre_exec() uses at the start |
2124 |
of a match, in order to speed up the process. For example, if it is |
of a match, in order to speed up the process. For example, if it is |
2125 |
known that a match must start with a specific character, it searches |
known that an unanchored match must start with a specific character, it |
2126 |
the subject for that character, and fails immediately if it cannot find |
searches the subject for that character, and fails immediately if it |
2127 |
it, without actually running the main matching function. When callouts |
cannot find it, without actually running the main matching function. |
2128 |
are in use, these optimizations can cause them to be skipped. This |
This means that a special item such as (*COMMIT) at the start of a pat- |
2129 |
option disables the "start-up" optimizations, causing performance to |
tern is not considered until after a suitable starting point for the |
2130 |
suffer, but ensuring that the callouts do occur. |
match has been found. When callouts are in use, these "start-up" opti- |
2131 |
|
mizations can cause them to be skipped if the pattern is never actually |
2132 |
|
used. The PCRE_NO_START_OPTIMIZE option disables the start-up optimiza- |
2133 |
|
tions, causing performance to suffer, but ensuring that the callouts do |
2134 |
|
occur, and that items such as (*COMMIT) are considered at every possi- |
2135 |
|
ble starting position in the subject string. |
2136 |
|
|
2137 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
2138 |
|
|
2139 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
2140 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
2141 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
2142 |
points to the start of a UTF-8 character. There is a discussion about |
points to the start of a UTF-8 character. There is a discussion about |
2143 |
the validity of UTF-8 strings in the section on UTF-8 support in the |
the validity of UTF-8 strings in the section on UTF-8 support in the |
2144 |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
2145 |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
2146 |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
2147 |
|
|
2148 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
2149 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
2150 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
2151 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
2152 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
2153 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
2154 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
2155 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
2156 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
2157 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
2158 |
|
|
2159 |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
2160 |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
2161 |
|
|
2162 |
These options turn on the partial matching feature. For backwards com- |
These options turn on the partial matching feature. For backwards com- |
2163 |
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial |
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial |
2164 |
match occurs if the end of the subject string is reached successfully, |
match occurs if the end of the subject string is reached successfully, |
2165 |
but there are not enough subject characters to complete the match. If |
but there are not enough subject characters to complete the match. If |
2166 |
this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately |
this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately |
2167 |
returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, |
returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, |
2168 |
matching continues by testing any other alternatives. Only if they all |
matching continues by testing any other alternatives. Only if they all |
2169 |
fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH). |
fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH). |
2170 |
The portion of the string that was inspected when the partial match was |
The portion of the string that was inspected when the partial match was |
2171 |
found is set as the first matching string. There is a more detailed |
found is set as the first matching string. There is a more detailed |
2172 |
discussion in the pcrepartial documentation. |
discussion in the pcrepartial documentation. |
2173 |
|
|
2174 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
2175 |
|
|
2176 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
2177 |
length (in bytes) in length, and a starting byte offset in startoffset. |
length (in bytes) in length, and a starting byte offset in startoffset. |
2178 |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
2179 |
acter. Unlike the pattern string, the subject may contain binary zero |
acter. Unlike the pattern string, the subject may contain binary zero |
2180 |
bytes. When the starting offset is zero, the search for a match starts |
bytes. When the starting offset is zero, the search for a match starts |
2181 |
at the beginning of the subject, and this is by far the most common |
at the beginning of the subject, and this is by far the most common |
2182 |
case. |
case. |
2183 |
|
|
2184 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
2185 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
2186 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
2187 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
2188 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
2189 |
|
|
2190 |
\Biss\B |
\Biss\B |
2191 |
|
|
2192 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
2193 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
2194 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
2195 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
2196 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
2197 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
2198 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
2199 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
2200 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
2201 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
2202 |
|
|
2203 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
2204 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
2205 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
2206 |
subject. |
subject. |
2207 |
|
|
2208 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
2209 |
|
|
2210 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
2211 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
2212 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
2213 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
2214 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
2215 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
2216 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
2217 |
|
|
2218 |
Captured substrings are returned to the caller via a vector of integers |
Captured substrings are returned to the caller via a vector of integers |
2219 |
whose address is passed in ovector. The number of elements in the vec- |
whose address is passed in ovector. The number of elements in the vec- |
2220 |
tor is passed in ovecsize, which must be a non-negative number. Note: |
tor is passed in ovecsize, which must be a non-negative number. Note: |
2221 |
this argument is NOT the size of ovector in bytes. |
this argument is NOT the size of ovector in bytes. |
2222 |
|
|
2223 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
2224 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
2225 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
2226 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
2227 |
The number passed in ovecsize should always be a multiple of three. If |
The number passed in ovecsize should always be a multiple of three. If |
2228 |
it is not, it is rounded down. |
it is not, it is rounded down. |
2229 |
|
|
2230 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
2231 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
2232 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
2233 |
element of each pair is set to the byte offset of the first character |
element of each pair is set to the byte offset of the first character |
2234 |
in a substring, and the second is set to the byte offset of the first |
in a substring, and the second is set to the byte offset of the first |
2235 |
character after the end of a substring. Note: these values are always |
character after the end of a substring. Note: these values are always |
2236 |
byte offsets, even in UTF-8 mode. They are not character counts. |
byte offsets, even in UTF-8 mode. They are not character counts. |
2237 |
|
|
2238 |
The first pair of integers, ovector[0] and ovector[1], identify the |
The first pair of integers, ovector[0] and ovector[1], identify the |
2239 |
portion of the subject string matched by the entire pattern. The next |
portion of the subject string matched by the entire pattern. The next |
2240 |
pair is used for the first capturing subpattern, and so on. The value |
pair is used for the first capturing subpattern, and so on. The value |
2241 |
returned by pcre_exec() is one more than the highest numbered pair that |
returned by pcre_exec() is one more than the highest numbered pair that |
2242 |
has been set. For example, if two substrings have been captured, the |
has been set. For example, if two substrings have been captured, the |
2243 |
returned value is 3. If there are no capturing subpatterns, the return |
returned value is 3. If there are no capturing subpatterns, the return |
2244 |
value from a successful match is 1, indicating that just the first pair |
value from a successful match is 1, indicating that just the first pair |
2245 |
of offsets has been set. |
of offsets has been set. |
2246 |
|
|
2247 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
2248 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
2249 |
|
|
2250 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
2251 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
2252 |
function returns a value of zero. If the substring offsets are not of |
function returns a value of zero. If the substring offsets are not of |
2253 |
interest, pcre_exec() may be called with ovector passed as NULL and |
interest, pcre_exec() may be called with ovector passed as NULL and |
2254 |
ovecsize as zero. However, if the pattern contains back references and |
ovecsize as zero. However, if the pattern contains back references and |
2255 |
the ovector is not big enough to remember the related substrings, PCRE |
the ovector is not big enough to remember the related substrings, PCRE |
2256 |
has to get additional memory for use during matching. Thus it is usu- |
has to get additional memory for use during matching. Thus it is usu- |
2257 |
ally advisable to supply an ovector. |
ally advisable to supply an ovector. |
2258 |
|
|
2259 |
The pcre_fullinfo() function can be used to find out how many capturing |
The pcre_fullinfo() function can be used to find out how many capturing |
2260 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
2261 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
2262 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
2263 |
|
|
2264 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
2265 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
2266 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
2267 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
2268 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
2269 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
2270 |
|
|
2271 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
2272 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
2273 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
2274 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
2275 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
2276 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
2277 |
the vector is large enough, of course). |
the vector is large enough, of course). |
2278 |
|
|
2279 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
2280 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
2281 |
|
|
2282 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
2283 |
|
|
2284 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
2285 |
defined in the header file: |
defined in the header file: |
2286 |
|
|
2287 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
2290 |
|
|
2291 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
2292 |
|
|
2293 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
2294 |
ovecsize was not zero. |
ovecsize was not zero. |
2295 |
|
|
2296 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
2299 |
|
|
2300 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
2301 |
|
|
2302 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
2303 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
2304 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
2305 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
2306 |
gives when the magic number is not present. |
gives when the magic number is not present. |
2307 |
|
|
2308 |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
2309 |
|
|
2310 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
2311 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
2312 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
2313 |
|
|
2314 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2315 |
|
|
2316 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
2317 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
2318 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
2319 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
2320 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
2321 |
|
|
2322 |
This error is also given if pcre_stack_malloc() fails in pcre_exec(). |
This error is also given if pcre_stack_malloc() fails in pcre_exec(). |
2323 |
This can happen only when PCRE has been compiled with --disable-stack- |
This can happen only when PCRE has been compiled with --disable-stack- |
2324 |
for-recursion. |
for-recursion. |
2325 |
|
|
2326 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
2327 |
|
|
2328 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
2329 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
2330 |
returned by pcre_exec(). |
returned by pcre_exec(). |
2331 |
|
|
2332 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
2333 |
|
|
2334 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
2335 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
2336 |
above. |
above. |
2337 |
|
|
2338 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
2339 |
|
|
2340 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
2341 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
2342 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
2343 |
|
|
2344 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
2345 |
|
|
2346 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
2347 |
subject. |
subject. |
2348 |
|
|
2349 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
2350 |
|
|
2351 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
2352 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
2353 |
ter. |
ter. |
2354 |
|
|
2355 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
2356 |
|
|
2357 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
2358 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
2359 |
|
|
2360 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
2361 |
|
|
2362 |
This code is no longer in use. It was formerly returned when the |
This code is no longer in use. It was formerly returned when the |
2363 |
PCRE_PARTIAL option was used with a compiled pattern containing items |
PCRE_PARTIAL option was used with a compiled pattern containing items |
2364 |
that were not supported for partial matching. From release 8.00 |
that were not supported for partial matching. From release 8.00 |
2365 |
onwards, there are no restrictions on partial matching. |
onwards, there are no restrictions on partial matching. |
2366 |
|
|
2367 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
2368 |
|
|
2369 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
2370 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
2371 |
|
|
2372 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
2376 |
PCRE_ERROR_RECURSIONLIMIT (-21) |
PCRE_ERROR_RECURSIONLIMIT (-21) |
2377 |
|
|
2378 |
The internal recursion limit, as specified by the match_limit_recursion |
The internal recursion limit, as specified by the match_limit_recursion |
2379 |
field in a pcre_extra structure (or defaulted) was reached. See the |
field in a pcre_extra structure (or defaulted) was reached. See the |
2380 |
description above. |
description above. |
2381 |
|
|
2382 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
2399 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
2400 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
2401 |
|
|
2402 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
2403 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
2404 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
2405 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
2406 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
2407 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
2408 |
substrings. |
substrings. |
2409 |
|
|
2410 |
A substring that contains a binary zero is correctly extracted and has |
A substring that contains a binary zero is correctly extracted and has |
2411 |
a further zero added on the end, but the result is not, of course, a C |
a further zero added on the end, but the result is not, of course, a C |
2412 |
string. However, you can process such a string by referring to the |
string. However, you can process such a string by referring to the |
2413 |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
2414 |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
2415 |
not adequate for handling strings containing binary zeros, because the |
not adequate for handling strings containing binary zeros, because the |
2416 |
end of the final string is not independently indicated. |
end of the final string is not independently indicated. |
2417 |
|
|
2418 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
2419 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
2420 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
2421 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
2422 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
2423 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
2424 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
2425 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
2426 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
2427 |
|
|
2428 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
2429 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
2430 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
2431 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
2432 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
2433 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
2434 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
2435 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
2436 |
the terminating zero, or one of these error codes: |
the terminating zero, or one of these error codes: |
2437 |
|
|
2438 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2439 |
|
|
2440 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
2441 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
2442 |
|
|
2443 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
2444 |
|
|
2445 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
2446 |
|
|
2447 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
2448 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
2449 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
2450 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
2451 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
2452 |
pointer. The yield of the function is zero if all went well, or the |
pointer. The yield of the function is zero if all went well, or the |
2453 |
error code |
error code |
2454 |
|
|
2455 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2456 |
|
|
2457 |
if the attempt to get the memory block failed. |
if the attempt to get the memory block failed. |
2458 |
|
|
2459 |
When any of these functions encounter a substring that is unset, which |
When any of these functions encounter a substring that is unset, which |
2460 |
can happen when capturing subpattern number n+1 matches some part of |
can happen when capturing subpattern number n+1 matches some part of |
2461 |
the subject, but subpattern n has not been used at all, they return an |
the subject, but subpattern n has not been used at all, they return an |
2462 |
empty string. This can be distinguished from a genuine zero-length sub- |
empty string. This can be distinguished from a genuine zero-length sub- |
2463 |
string by inspecting the appropriate offset in ovector, which is nega- |
string by inspecting the appropriate offset in ovector, which is nega- |
2464 |
tive for unset substrings. |
tive for unset substrings. |
2465 |
|
|
2466 |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
2467 |
string_list() can be used to free the memory returned by a previous |
string_list() can be used to free the memory returned by a previous |
2468 |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
2469 |
tively. They do nothing more than call the function pointed to by |
tively. They do nothing more than call the function pointed to by |
2470 |
pcre_free, which of course could be called directly from a C program. |
pcre_free, which of course could be called directly from a C program. |
2471 |
However, PCRE is used in some situations where it is linked via a spe- |
However, PCRE is used in some situations where it is linked via a spe- |
2472 |
cial interface to another programming language that cannot use |
cial interface to another programming language that cannot use |
2473 |
pcre_free directly; it is for these cases that the functions are pro- |
pcre_free directly; it is for these cases that the functions are pro- |
2474 |
vided. |
vided. |
2475 |
|
|
2476 |
|
|
2489 |
int stringcount, const char *stringname, |
int stringcount, const char *stringname, |
2490 |
const char **stringptr); |
const char **stringptr); |
2491 |
|
|
2492 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
2493 |
ber. For example, for this pattern |
ber. For example, for this pattern |
2494 |
|
|
2495 |
(a+)b(?<xxx>\d+)... |
(a+)b(?<xxx>\d+)... |
2498 |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
2499 |
name by calling pcre_get_stringnumber(). The first argument is the com- |
name by calling pcre_get_stringnumber(). The first argument is the com- |
2500 |
piled pattern, and the second is the name. The yield of the function is |
piled pattern, and the second is the name. The yield of the function is |
2501 |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
2502 |
subpattern of that name. |
subpattern of that name. |
2503 |
|
|
2504 |
Given the number, you can extract the substring directly, or use one of |
Given the number, you can extract the substring directly, or use one of |
2505 |
the functions described in the previous section. For convenience, there |
the functions described in the previous section. For convenience, there |
2506 |
are also two functions that do the whole job. |
are also two functions that do the whole job. |
2507 |
|
|
2508 |
Most of the arguments of pcre_copy_named_substring() and |
Most of the arguments of pcre_copy_named_substring() and |
2509 |
pcre_get_named_substring() are the same as those for the similarly |
pcre_get_named_substring() are the same as those for the similarly |
2510 |
named functions that extract by number. As these are described in the |
named functions that extract by number. As these are described in the |
2511 |
previous section, they are not re-described here. There are just two |
previous section, they are not re-described here. There are just two |
2512 |
differences: |
differences: |
2513 |
|
|
2514 |
First, instead of a substring number, a substring name is given. Sec- |
First, instead of a substring number, a substring name is given. Sec- |
2515 |
ond, there is an extra argument, given at the start, which is a pointer |
ond, there is an extra argument, given at the start, which is a pointer |
2516 |
to the compiled pattern. This is needed in order to gain access to the |
to the compiled pattern. This is needed in order to gain access to the |
2517 |
name-to-number translation table. |
name-to-number translation table. |
2518 |
|
|
2519 |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
2520 |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
2521 |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
2522 |
behaviour may not be what you want (see the next section). |
behaviour may not be what you want (see the next section). |
2523 |
|
|
2524 |
Warning: If the pattern uses the (?| feature to set up multiple subpat- |
Warning: If the pattern uses the (?| feature to set up multiple subpat- |
2525 |
terns with the same number, as described in the section on duplicate |
terns with the same number, as described in the section on duplicate |
2526 |
subpattern numbers in the pcrepattern page, you cannot use names to |
subpattern numbers in the pcrepattern page, you cannot use names to |
2527 |
distinguish the different subpatterns, because names are not included |
distinguish the different subpatterns, because names are not included |
2528 |
in the compiled code. The matching process uses only numbers. For this |
in the compiled code. The matching process uses only numbers. For this |
2529 |
reason, the use of different names for subpatterns of the same number |
reason, the use of different names for subpatterns of the same number |
2530 |
causes an error at compile time. |
causes an error at compile time. |
2531 |
|
|
2532 |
|
|
2535 |
int pcre_get_stringtable_entries(const pcre *code, |
int pcre_get_stringtable_entries(const pcre *code, |
2536 |
const char *name, char **first, char **last); |
const char *name, char **first, char **last); |
2537 |
|
|
2538 |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
2539 |
subpatterns are not required to be unique. (Duplicate names are always |
subpatterns are not required to be unique. (Duplicate names are always |
2540 |
allowed for subpatterns with the same number, created by using the (?| |
allowed for subpatterns with the same number, created by using the (?| |
2541 |
feature. Indeed, if such subpatterns are named, they are required to |
feature. Indeed, if such subpatterns are named, they are required to |
2542 |
use the same names.) |
use the same names.) |
2543 |
|
|
2544 |
Normally, patterns with duplicate names are such that in any one match, |
Normally, patterns with duplicate names are such that in any one match, |
2545 |
only one of the named subpatterns participates. An example is shown in |
only one of the named subpatterns participates. An example is shown in |
2546 |
the pcrepattern documentation. |
the pcrepattern documentation. |
2547 |
|
|
2548 |
When duplicates are present, pcre_copy_named_substring() and |
When duplicates are present, pcre_copy_named_substring() and |
2549 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
2550 |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
2551 |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
2552 |
function returns one of the numbers that are associated with the name, |
function returns one of the numbers that are associated with the name, |
2553 |
but it is not defined which it is. |
but it is not defined which it is. |
2554 |
|
|
2555 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
2556 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
2557 |
first argument is the compiled pattern, and the second is the name. The |
first argument is the compiled pattern, and the second is the name. The |
2558 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
2559 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
2560 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
2561 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
2562 |
there are none. The format of the table is described above in the sec- |
there are none. The format of the table is described above in the sec- |
2563 |
tion entitled Information about a pattern. Given all the relevant |
tion entitled Information about a pattern. Given all the relevant |
2564 |
entries for the name, you can extract each of their numbers, and hence |
entries for the name, you can extract each of their numbers, and hence |
2565 |
the captured data, if any. |
the captured data, if any. |
2566 |
|
|
2567 |
|
|
2568 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
2569 |
|
|
2570 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
2571 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
2572 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
2573 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
2574 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
2575 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
2576 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
2577 |
tation. |
tation. |
2578 |
|
|
2579 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
2580 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
2581 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
2582 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
2583 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
2584 |
|
|
2585 |
|
|
2590 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
2591 |
int *workspace, int wscount); |
int *workspace, int wscount); |
2592 |
|
|
2593 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
2594 |
against a compiled pattern, using a matching algorithm that scans the |
against a compiled pattern, using a matching algorithm that scans the |
2595 |
subject string just once, and does not backtrack. This has different |
subject string just once, and does not backtrack. This has different |
2596 |
characteristics to the normal algorithm, and is not compatible with |
characteristics to the normal algorithm, and is not compatible with |
2597 |
Perl. Some of the features of PCRE patterns are not supported. Never- |
Perl. Some of the features of PCRE patterns are not supported. Never- |
2598 |
theless, there are times when this kind of matching can be useful. For |
theless, there are times when this kind of matching can be useful. For |
2599 |
a discussion of the two matching algorithms, and a list of features |
a discussion of the two matching algorithms, and a list of features |
2600 |
that pcre_dfa_exec() does not support, see the pcrematching documenta- |
that pcre_dfa_exec() does not support, see the pcrematching documenta- |
2601 |
tion. |
tion. |
2602 |
|
|
2603 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
2604 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
2605 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
2606 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
2607 |
repeated here. |
repeated here. |
2608 |
|
|
2609 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
2610 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
2611 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
2612 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
2613 |
lot of potential matches. |
lot of potential matches. |
2614 |
|
|
2615 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
2631 |
|
|
2632 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
2633 |
|
|
2634 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
2635 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
2636 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, |
2637 |
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR- |
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, |
2638 |
TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR- |
2639 |
|
TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
2640 |
four of these are exactly the same as for pcre_exec(), so their |
four of these are exactly the same as for pcre_exec(), so their |
2641 |
description is not repeated here. |
description is not repeated here. |
2642 |
|
|
2759 |
|
|
2760 |
REVISION |
REVISION |
2761 |
|
|
2762 |
Last updated: 01 June 2010 |
Last updated: 15 June 2010 |
2763 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
2764 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2765 |
|
|