92 |
use these to include support for different releases. |
use these to include support for different releases. |
93 |
|
|
94 |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
95 |
are used for compiling and matching regular expressions. |
are used for compiling and matching regular expressions. A sample program that |
96 |
|
demonstrates the simplest way of using them is given in the file |
97 |
|
\fIpcredemo.c\fR. The last section of this man page describes how to run it. |
98 |
|
|
99 |
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
100 |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
131 |
The function \fBpcre_compile()\fR is called to compile a pattern into an |
The function \fBpcre_compile()\fR is called to compile a pattern into an |
132 |
internal form. The pattern is a C string terminated by a binary zero, and |
internal form. The pattern is a C string terminated by a binary zero, and |
133 |
is passed in the argument \fIpattern\fR. A pointer to a single block of memory |
is passed in the argument \fIpattern\fR. A pointer to a single block of memory |
134 |
that is obtained via \fBpcre_malloc\fR is returned. This contains the |
that is obtained via \fBpcre_malloc\fR is returned. This contains the compiled |
135 |
compiled code and related data. The \fBpcre\fR type is defined for this for |
code and related data. The \fBpcre\fR type is defined for the returned block; |
136 |
convenience, but in fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the |
this is a typedef for a structure whose contents are not externally defined. It |
137 |
contents of the block are not externally defined. It is up to the caller to |
is up to the caller to free the memory when it is no longer required. |
138 |
free the memory when it is no longer required. |
|
139 |
.PP |
Although the compiled code of a PCRE regex is relocatable, that is, it does not |
140 |
|
depend on memory location, the complete \fBpcre\fR data block is not |
141 |
|
fully relocatable, because it contains a copy of the \fItableptr\fR argument, |
142 |
|
which is an address (see below). |
143 |
|
|
144 |
The size of a compiled pattern is roughly proportional to the length of the |
The size of a compiled pattern is roughly proportional to the length of the |
145 |
pattern string, except that each character class (other than those containing |
pattern string, except that each character class (other than those containing |
146 |
just a single character, negated or not) requires 33 bytes, and repeat |
just a single character, negated or not) requires 33 bytes, and repeat |
147 |
quantifiers with a minimum greater than one or a bounded maximum cause the |
quantifiers with a minimum greater than one or a bounded maximum cause the |
148 |
relevant portions of the compiled pattern to be replicated. |
relevant portions of the compiled pattern to be replicated. |
149 |
.PP |
|
150 |
The \fIoptions\fR argument contains independent bits that affect the |
The \fIoptions\fR argument contains independent bits that affect the |
151 |
compilation. It should be zero if no options are required. Some of the options, |
compilation. It should be zero if no options are required. Some of the options, |
152 |
in particular, those that are compatible with Perl, can also be set and unset |
in particular, those that are compatible with Perl, can also be set and unset |
155 |
their initial settings at the start of compilation and execution. The |
their initial settings at the start of compilation and execution. The |
156 |
PCRE_ANCHORED option can be set at the time of matching as well as at compile |
PCRE_ANCHORED option can be set at the time of matching as well as at compile |
157 |
time. |
time. |
158 |
.PP |
|
159 |
If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately. |
If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately. |
160 |
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns |
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns |
161 |
NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual |
NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual |
162 |
error message. The offset from the start of the pattern to the character where |
error message. The offset from the start of the pattern to the character where |
163 |
the error was discovered is placed in the variable pointed to by |
the error was discovered is placed in the variable pointed to by |
164 |
\fIerroffset\fR, which must not be NULL. If it is, an immediate error is given. |
\fIerroffset\fR, which must not be NULL. If it is, an immediate error is given. |
165 |
.PP |
|
166 |
If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of |
If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of |
167 |
character tables which are built when it is compiled, using the default C |
character tables which are built when it is compiled, using the default C |
168 |
locale. Otherwise, \fItableptr\fR must be the result of a call to |
locale. Otherwise, \fItableptr\fR must be the result of a call to |
169 |
\fBpcre_maketables()\fR. See the section on locale support below. |
\fBpcre_maketables()\fR. See the section on locale support below. |
170 |
.PP |
|
171 |
|
This code fragment shows a typical straightforward call to \fBpcre_compile()\fR: |
172 |
|
|
173 |
|
pcre *re; |
174 |
|
const char *error; |
175 |
|
int erroffset; |
176 |
|
re = pcre_compile( |
177 |
|
"^A.*Z", /* the pattern */ |
178 |
|
0, /* default options */ |
179 |
|
&error, /* for error message */ |
180 |
|
&erroffset, /* for error offset */ |
181 |
|
NULL); /* use default character tables */ |
182 |
|
|
183 |
The following option bits are defined in the header file: |
The following option bits are defined in the header file: |
184 |
|
|
185 |
PCRE_ANCHORED |
PCRE_ANCHORED |
266 |
When a pattern is going to be used several times, it is worth spending more |
When a pattern is going to be used several times, it is worth spending more |
267 |
time analyzing it in order to speed up the time taken for matching. The |
time analyzing it in order to speed up the time taken for matching. The |
268 |
function \fBpcre_study()\fR takes a pointer to a compiled pattern as its first |
function \fBpcre_study()\fR takes a pointer to a compiled pattern as its first |
269 |
argument, and returns a pointer to a \fBpcre_extra\fR block (another \fBvoid\fR |
argument, and returns a pointer to a \fBpcre_extra\fR block (another typedef |
270 |
typedef) containing additional information about the pattern; this can be |
for a structure with hidden contents) containing additional information about |
271 |
passed to \fBpcre_exec()\fR. If no additional information is available, NULL |
the pattern; this can be passed to \fBpcre_exec()\fR. If no additional |
272 |
is returned. |
information is available, NULL is returned. |
273 |
|
|
274 |
The second argument contains option bits. At present, no options are defined |
The second argument contains option bits. At present, no options are defined |
275 |
for \fBpcre_study()\fR, and this argument should always be zero. |
for \fBpcre_study()\fR, and this argument should always be zero. |
278 |
studying succeeds (even if no data is returned), the variable it points to is |
studying succeeds (even if no data is returned), the variable it points to is |
279 |
set to NULL. Otherwise it points to a textual error message. |
set to NULL. Otherwise it points to a textual error message. |
280 |
|
|
281 |
|
This is a typical call to \fBpcre_study\fR(): |
282 |
|
|
283 |
|
pcre_extra *pe; |
284 |
|
pe = pcre_study( |
285 |
|
re, /* result of pcre_compile() */ |
286 |
|
0, /* no options exist */ |
287 |
|
&error); /* set to NULL or points to a message */ |
288 |
|
|
289 |
At present, studying a pattern is useful only for non-anchored patterns that do |
At present, studying a pattern is useful only for non-anchored patterns that do |
290 |
not have a single fixed starting character. A bitmap of possible starting |
not have a single fixed starting character. A bitmap of possible starting |
291 |
characters is created. |
characters is created. |
335 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
336 |
PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid |
PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid |
337 |
|
|
338 |
|
Here is a typical call of \fBpcre_fullinfo()\fR, to obtain the length of the |
339 |
|
compiled pattern: |
340 |
|
|
341 |
|
int rc; |
342 |
|
unsigned long int length; |
343 |
|
rc = pcre_fullinfo( |
344 |
|
re, /* result of pcre_compile() */ |
345 |
|
pe, /* result of pcre_study(), or NULL */ |
346 |
|
PCRE_INFO_SIZE, /* what is required */ |
347 |
|
&length); /* where to put the data */ |
348 |
|
|
349 |
The possible values for the third argument are defined in \fBpcre.h\fR, and are |
The possible values for the third argument are defined in \fBpcre.h\fR, and are |
350 |
as follows: |
as follows: |
351 |
|
|
352 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
353 |
|
|
354 |
Return a copy of the options with which the pattern was compiled. The fourth |
Return a copy of the options with which the pattern was compiled. The fourth |
355 |
argument should point to au \fBunsigned long int\fR variable. These option bits |
argument should point to an \fBunsigned long int\fR variable. These option bits |
356 |
are those specified in the call to \fBpcre_compile()\fR, modified by any |
are those specified in the call to \fBpcre_compile()\fR, modified by any |
357 |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
358 |
bit forcibly set if the form of the pattern implies that it can match only at |
bit forcibly set if the form of the pattern implies that it can match only at |
433 |
pattern has been studied, the result of the study should be passed in the |
pattern has been studied, the result of the study should be passed in the |
434 |
\fIextra\fR argument. Otherwise this must be NULL. |
\fIextra\fR argument. Otherwise this must be NULL. |
435 |
|
|
436 |
|
Here is an example of a simple call to \fBpcre_exec()\fR: |
437 |
|
|
438 |
|
int rc; |
439 |
|
int ovector[30]; |
440 |
|
rc = pcre_exec( |
441 |
|
re, /* result of pcre_compile() */ |
442 |
|
NULL, /* we didn't study the pattern */ |
443 |
|
"some string", /* the subject string */ |
444 |
|
11, /* the length of the subject string */ |
445 |
|
0, /* start at offset 0 in the subject */ |
446 |
|
0, /* default options */ |
447 |
|
ovector, /* vector for substring information */ |
448 |
|
30); /* number of elements in the vector */ |
449 |
|
|
450 |
The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose |
The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose |
451 |
unused bits must be zero. However, if a pattern was compiled with |
unused bits must be zero. However, if a pattern was compiled with |
452 |
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it |
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it |
488 |
|
|
489 |
The subject string is passed as a pointer in \fIsubject\fR, a length in |
The subject string is passed as a pointer in \fIsubject\fR, a length in |
490 |
\fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern |
\fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern |
491 |
string, it may contain binary zero characters. When the starting offset is |
string, the subject may contain binary zero characters. When the starting |
492 |
zero, the search for a match starts at the beginning of the subject, and this |
offset is zero, the search for a match starts at the beginning of the subject, |
493 |
is by far the most common case. |
and this is by far the most common case. |
494 |
|
|
495 |
A non-zero starting offset is useful when searching for another match in the |
A non-zero starting offset is useful when searching for another match in the |
496 |
same subject by calling \fBpcre_exec()\fR again after a previous success. |
same subject by calling \fBpcre_exec()\fR again after a previous success. |
677 |
practice be relevant. |
practice be relevant. |
678 |
The maximum length of a compiled pattern is 65539 (sic) bytes. |
The maximum length of a compiled pattern is 65539 (sic) bytes. |
679 |
All values in repeating quantifiers must be less than 65536. |
All values in repeating quantifiers must be less than 65536. |
680 |
The maximum number of capturing subpatterns is 99. |
There maximum number of capturing subpatterns is 65535. |
681 |
The maximum number of all parenthesized subpatterns, including capturing |
There is no limit to the number of non-capturing subpatterns, but the maximum |
682 |
|
depth of nesting of all kinds of parenthesized subpattern, including capturing |
683 |
subpatterns, assertions, and other types of subpattern, is 200. |
subpatterns, assertions, and other types of subpattern, is 200. |
684 |
|
|
685 |
The maximum length of a subject string is the largest positive number that an |
The maximum length of a subject string is the largest positive number that an |
1001 |
|
|
1002 |
Note that the sequences \\A, \\Z, and \\z can be used to match the start and |
Note that the sequences \\A, \\Z, and \\z can be used to match the start and |
1003 |
end of the subject in both modes, and if all branches of a pattern start with |
end of the subject in both modes, and if all branches of a pattern start with |
1004 |
\\A is it always anchored, whether PCRE_MULTILINE is set or not. |
\\A it is always anchored, whether PCRE_MULTILINE is set or not. |
1005 |
|
|
1006 |
|
|
1007 |
.SH FULL STOP (PERIOD, DOT) |
.SH FULL STOP (PERIOD, DOT) |
1105 |
|
|
1106 |
[12[:^digit:]] |
[12[:^digit:]] |
1107 |
|
|
1108 |
matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX |
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX |
1109 |
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
1110 |
supported, and an error is given if they are encountered. |
supported, and an error is given if they are encountered. |
1111 |
|
|
1203 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
1204 |
|
|
1205 |
the captured substrings are "red king", "red", and "king", and are numbered 1, |
the captured substrings are "red king", "red", and "king", and are numbered 1, |
1206 |
2, and 3. |
2, and 3, respectively. |
1207 |
|
|
1208 |
The fact that plain parentheses fulfil two functions is not always helpful. |
The fact that plain parentheses fulfil two functions is not always helpful. |
1209 |
There are often times when a grouping subpattern is required without a |
There are often times when a grouping subpattern is required without a |
1297 |
|
|
1298 |
/* first command */ not comment /* second comment */ |
/* first command */ not comment /* second comment */ |
1299 |
|
|
1300 |
fails, because it matches the entire string due to the greediness of the .* |
fails, because it matches the entire string owing to the greediness of the .* |
1301 |
item. |
item. |
1302 |
|
|
1303 |
However, if a quantifier is followed by a question mark, it ceases to be |
However, if a quantifier is followed by a question mark, it ceases to be |
1616 |
|
|
1617 |
There are two kinds of condition. If the text between the parentheses consists |
There are two kinds of condition. If the text between the parentheses consists |
1618 |
of a sequence of digits, the condition is satisfied if the capturing subpattern |
of a sequence of digits, the condition is satisfied if the capturing subpattern |
1619 |
of that number has previously matched. Consider the following pattern, which |
of that number has previously matched. The number must be greater than zero. |
1620 |
contains non-significant white space to make it more readable (assume the |
Consider the following pattern, which contains non-significant white space to |
1621 |
PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: |
make it more readable (assume the PCRE_EXTENDED option) and to divide it into |
1622 |
|
three parts for ease of discussion: |
1623 |
|
|
1624 |
( \\( )? [^()]+ (?(1) \\) ) |
( \\( )? [^()]+ (?(1) \\) ) |
1625 |
|
|
1844 |
|
|
1845 |
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X. |
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X. |
1846 |
|
|
1847 |
|
|
1848 |
|
.SH SAMPLE PROGRAM |
1849 |
|
The code below is a simple, complete demonstration program, to get you started |
1850 |
|
with using PCRE. This code is also supplied in the file \fIpcredemo.c\fR in the |
1851 |
|
PCRE distribution. |
1852 |
|
|
1853 |
|
The program compiles the regular expression that is its first argument, and |
1854 |
|
matches it against the subject string in its second argument. No options are |
1855 |
|
set, and default character tables are used. If matching succeeds, the program |
1856 |
|
outputs the portion of the subject that matched, together with the contents of |
1857 |
|
any captured substrings. |
1858 |
|
|
1859 |
|
On a Unix system that has PCRE installed in \fI/usr/local\fR, you can compile |
1860 |
|
the demonstration program using a command like this: |
1861 |
|
|
1862 |
|
gcc -o pcredemo pcredemo.c -I/usr/local/include -L/usr/local/lib -lpcre |
1863 |
|
|
1864 |
|
Then you can run simple tests like this: |
1865 |
|
|
1866 |
|
./pcredemo 'cat|dog' 'the cat sat on the mat' |
1867 |
|
|
1868 |
|
Note that there is a much more comprehensive test program, called |
1869 |
|
\fBpcretest\fR, which supports many more facilities for testing regular |
1870 |
|
expressions. The \fBpcredemo\fR program is provided as a simple coding example. |
1871 |
|
|
1872 |
|
On some operating systems (e.g. Solaris) you may get an error like this when |
1873 |
|
you try to run \fBpcredemo\fR: |
1874 |
|
|
1875 |
|
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or directory |
1876 |
|
|
1877 |
|
This is caused by the way shared library support works on those systems. You |
1878 |
|
need to add |
1879 |
|
|
1880 |
|
-R/usr/local/lib |
1881 |
|
|
1882 |
|
to the compile command to get round this problem. Here's the code: |
1883 |
|
|
1884 |
|
#include <stdio.h> |
1885 |
|
#include <string.h> |
1886 |
|
#include <pcre.h> |
1887 |
|
|
1888 |
|
#define OVECCOUNT 30 /* should be a multiple of 3 */ |
1889 |
|
|
1890 |
|
int main(int argc, char **argv) |
1891 |
|
{ |
1892 |
|
pcre *re; |
1893 |
|
const char *error; |
1894 |
|
int erroffset; |
1895 |
|
int ovector[OVECCOUNT]; |
1896 |
|
int rc, i; |
1897 |
|
|
1898 |
|
if (argc != 3) |
1899 |
|
{ |
1900 |
|
printf("Two arguments required: a regex and a " |
1901 |
|
"subject string\\n"); |
1902 |
|
return 1; |
1903 |
|
} |
1904 |
|
|
1905 |
|
/* Compile the regular expression in the first argument */ |
1906 |
|
|
1907 |
|
re = pcre_compile( |
1908 |
|
argv[1], /* the pattern */ |
1909 |
|
0, /* default options */ |
1910 |
|
&error, /* for error message */ |
1911 |
|
&erroffset, /* for error offset */ |
1912 |
|
NULL); /* use default character tables */ |
1913 |
|
|
1914 |
|
/* Compilation failed: print the error message and exit */ |
1915 |
|
|
1916 |
|
if (re == NULL) |
1917 |
|
{ |
1918 |
|
printf("PCRE compilation failed at offset %d: %s\\n", |
1919 |
|
erroffset, error); |
1920 |
|
return 1; |
1921 |
|
} |
1922 |
|
|
1923 |
|
/* Compilation succeeded: match the subject in the second |
1924 |
|
argument */ |
1925 |
|
|
1926 |
|
rc = pcre_exec( |
1927 |
|
re, /* the compiled pattern */ |
1928 |
|
NULL, /* we didn't study the pattern */ |
1929 |
|
argv[2], /* the subject string */ |
1930 |
|
(int)strlen(argv[2]), /* the length of the subject */ |
1931 |
|
0, /* start at offset 0 in the subject */ |
1932 |
|
0, /* default options */ |
1933 |
|
ovector, /* vector for substring information */ |
1934 |
|
OVECCOUNT); /* number of elements in the vector */ |
1935 |
|
|
1936 |
|
/* Matching failed: handle error cases */ |
1937 |
|
|
1938 |
|
if (rc < 0) |
1939 |
|
{ |
1940 |
|
switch(rc) |
1941 |
|
{ |
1942 |
|
case PCRE_ERROR_NOMATCH: printf("No match\\n"); break; |
1943 |
|
/* |
1944 |
|
Handle other special cases if you like |
1945 |
|
*/ |
1946 |
|
default: printf("Matching error %d\\n", rc); break; |
1947 |
|
} |
1948 |
|
return 1; |
1949 |
|
} |
1950 |
|
|
1951 |
|
/* Match succeded */ |
1952 |
|
|
1953 |
|
printf("Match succeeded\\n"); |
1954 |
|
|
1955 |
|
/* The output vector wasn't big enough */ |
1956 |
|
|
1957 |
|
if (rc == 0) |
1958 |
|
{ |
1959 |
|
rc = OVECCOUNT/3; |
1960 |
|
printf("ovector only has room for %d captured " |
1961 |
|
substrings\\n", rc - 1); |
1962 |
|
} |
1963 |
|
|
1964 |
|
/* Show substrings stored in the output vector */ |
1965 |
|
|
1966 |
|
for (i = 0; i < rc; i++) |
1967 |
|
{ |
1968 |
|
char *substring_start = argv[2] + ovector[2*i]; |
1969 |
|
int substring_length = ovector[2*i+1] - ovector[2*i]; |
1970 |
|
printf("%2d: %.*s\\n", i, substring_length, |
1971 |
|
substring_start); |
1972 |
|
} |
1973 |
|
|
1974 |
|
return 0; |
1975 |
|
} |
1976 |
|
|
1977 |
|
|
1978 |
.SH AUTHOR |
.SH AUTHOR |
1979 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
1980 |
.br |
.br |
1986 |
.br |
.br |
1987 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
1988 |
|
|
1989 |
Last updated: 28 August 2000, |
Last updated: 15 August 2001 |
|
.br |
|
|
the 250th anniversary of the death of J.S. Bach. |
|
1990 |
.br |
.br |
1991 |
Copyright (c) 1997-2000 University of Cambridge. |
Copyright (c) 1997-2001 University of Cambridge. |