1 |
.TH PCREPOSIX 3
|
2 |
.SH NAME
|
3 |
PCRE - Perl-compatible regular expressions.
|
4 |
.SH "SYNOPSIS OF POSIX API"
|
5 |
.rs
|
6 |
.sp
|
7 |
.B #include <pcreposix.h>
|
8 |
.PP
|
9 |
.SM
|
10 |
.B int regcomp(regex_t *\fIpreg\fP, const char *\fIpattern\fP,
|
11 |
.ti +5n
|
12 |
.B int \fIcflags\fP);
|
13 |
.PP
|
14 |
.B int regexec(regex_t *\fIpreg\fP, const char *\fIstring\fP,
|
15 |
.ti +5n
|
16 |
.B size_t \fInmatch\fP, regmatch_t \fIpmatch\fP[], int \fIeflags\fP);
|
17 |
.PP
|
18 |
.B size_t regerror(int \fIerrcode\fP, const regex_t *\fIpreg\fP,
|
19 |
.ti +5n
|
20 |
.B char *\fIerrbuf\fP, size_t \fIerrbuf_size\fP);
|
21 |
.PP
|
22 |
.B void regfree(regex_t *\fIpreg\fP);
|
23 |
.
|
24 |
.SH DESCRIPTION
|
25 |
.rs
|
26 |
.sp
|
27 |
This set of functions provides a POSIX-style API to the PCRE regular expression
|
28 |
package. See the
|
29 |
.\" HREF
|
30 |
\fBpcreapi\fP
|
31 |
.\"
|
32 |
documentation for a description of PCRE's native API, which contains much
|
33 |
additional functionality.
|
34 |
.P
|
35 |
The functions described here are just wrapper functions that ultimately call
|
36 |
the PCRE native API. Their prototypes are defined in the \fBpcreposix.h\fP
|
37 |
header file, and on Unix systems the library itself is called
|
38 |
\fBpcreposix.a\fP, so can be accessed by adding \fB-lpcreposix\fP to the
|
39 |
command for linking an application that uses them. Because the POSIX functions
|
40 |
call the native ones, it is also necessary to add \fB-lpcre\fP.
|
41 |
.P
|
42 |
I have implemented only those option bits that can be reasonably mapped to PCRE
|
43 |
native options. In addition, the option REG_EXTENDED is defined with the value
|
44 |
zero. This has no effect, but since programs that are written to the POSIX
|
45 |
interface often use it, this makes it easier to slot in PCRE as a replacement
|
46 |
library. Other POSIX options are not even defined.
|
47 |
.P
|
48 |
When PCRE is called via these functions, it is only the API that is POSIX-like
|
49 |
in style. The syntax and semantics of the regular expressions themselves are
|
50 |
still those of Perl, subject to the setting of various PCRE options, as
|
51 |
described below. "POSIX-like in style" means that the API approximates to the
|
52 |
POSIX definition; it is not fully POSIX-compatible, and in multi-byte encoding
|
53 |
domains it is probably even less compatible.
|
54 |
.P
|
55 |
The header for these functions is supplied as \fBpcreposix.h\fP to avoid any
|
56 |
potential clash with other POSIX libraries. It can, of course, be renamed or
|
57 |
aliased as \fBregex.h\fP, which is the "correct" name. It provides two
|
58 |
structure types, \fIregex_t\fP for compiled internal forms, and
|
59 |
\fIregmatch_t\fP for returning captured substrings. It also defines some
|
60 |
constants whose names start with "REG_"; these are used for setting options and
|
61 |
identifying error codes.
|
62 |
.P
|
63 |
.SH "COMPILING A PATTERN"
|
64 |
.rs
|
65 |
.sp
|
66 |
The function \fBregcomp()\fP is called to compile a pattern into an
|
67 |
internal form. The pattern is a C string terminated by a binary zero, and
|
68 |
is passed in the argument \fIpattern\fP. The \fIpreg\fP argument is a pointer
|
69 |
to a \fBregex_t\fP structure that is used as a base for storing information
|
70 |
about the compiled regular expression.
|
71 |
.P
|
72 |
The argument \fIcflags\fP is either zero, or contains one or more of the bits
|
73 |
defined by the following macros:
|
74 |
.sp
|
75 |
REG_DOTALL
|
76 |
.sp
|
77 |
The PCRE_DOTALL option is set when the regular expression is passed for
|
78 |
compilation to the native function. Note that REG_DOTALL is not part of the
|
79 |
POSIX standard.
|
80 |
.sp
|
81 |
REG_ICASE
|
82 |
.sp
|
83 |
The PCRE_CASELESS option is set when the regular expression is passed for
|
84 |
compilation to the native function.
|
85 |
.sp
|
86 |
REG_NEWLINE
|
87 |
.sp
|
88 |
The PCRE_MULTILINE option is set when the regular expression is passed for
|
89 |
compilation to the native function. Note that this does \fInot\fP mimic the
|
90 |
defined POSIX behaviour for REG_NEWLINE (see the following section).
|
91 |
.sp
|
92 |
REG_NOSUB
|
93 |
.sp
|
94 |
The PCRE_NO_AUTO_CAPTURE option is set when the regular expression is passed
|
95 |
for compilation to the native function. In addition, when a pattern that is
|
96 |
compiled with this flag is passed to \fBregexec()\fP for matching, the
|
97 |
\fInmatch\fP and \fIpmatch\fP arguments are ignored, and no captured strings
|
98 |
are returned.
|
99 |
.sp
|
100 |
REG_UTF8
|
101 |
.sp
|
102 |
The PCRE_UTF8 option is set when the regular expression is passed for
|
103 |
compilation to the native function. This causes the pattern itself and all data
|
104 |
strings used for matching it to be treated as UTF-8 strings. Note that REG_UTF8
|
105 |
is not part of the POSIX standard.
|
106 |
.P
|
107 |
In the absence of these flags, no options are passed to the native function.
|
108 |
This means the the regex is compiled with PCRE default semantics. In
|
109 |
particular, the way it handles newline characters in the subject string is the
|
110 |
Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only
|
111 |
\fIsome\fP of the effects specified for REG_NEWLINE. It does not affect the way
|
112 |
newlines are matched by . (they aren't) or by a negative class such as [^a]
|
113 |
(they are).
|
114 |
.P
|
115 |
The yield of \fBregcomp()\fP is zero on success, and non-zero otherwise. The
|
116 |
\fIpreg\fP structure is filled in on success, and one member of the structure
|
117 |
is public: \fIre_nsub\fP contains the number of capturing subpatterns in
|
118 |
the regular expression. Various error codes are defined in the header file.
|
119 |
.
|
120 |
.
|
121 |
.SH "MATCHING NEWLINE CHARACTERS"
|
122 |
.rs
|
123 |
.sp
|
124 |
This area is not simple, because POSIX and Perl take different views of things.
|
125 |
It is not possible to get PCRE to obey POSIX semantics, but then PCRE was never
|
126 |
intended to be a POSIX engine. The following table lists the different
|
127 |
possibilities for matching newline characters in PCRE:
|
128 |
.sp
|
129 |
Default Change with
|
130 |
.sp
|
131 |
. matches newline no PCRE_DOTALL
|
132 |
newline matches [^a] yes not changeable
|
133 |
$ matches \en at end yes PCRE_DOLLARENDONLY
|
134 |
$ matches \en in middle no PCRE_MULTILINE
|
135 |
^ matches \en in middle no PCRE_MULTILINE
|
136 |
.sp
|
137 |
This is the equivalent table for POSIX:
|
138 |
.sp
|
139 |
Default Change with
|
140 |
.sp
|
141 |
. matches newline yes REG_NEWLINE
|
142 |
newline matches [^a] yes REG_NEWLINE
|
143 |
$ matches \en at end no REG_NEWLINE
|
144 |
$ matches \en in middle no REG_NEWLINE
|
145 |
^ matches \en in middle no REG_NEWLINE
|
146 |
.sp
|
147 |
PCRE's behaviour is the same as Perl's, except that there is no equivalent for
|
148 |
PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is no way to stop
|
149 |
newline from matching [^a].
|
150 |
.P
|
151 |
The default POSIX newline handling can be obtained by setting PCRE_DOTALL and
|
152 |
PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE behave exactly as for the
|
153 |
REG_NEWLINE action.
|
154 |
.
|
155 |
.
|
156 |
.SH "MATCHING A PATTERN"
|
157 |
.rs
|
158 |
.sp
|
159 |
The function \fBregexec()\fP is called to match a compiled pattern \fIpreg\fP
|
160 |
against a given \fIstring\fP, which is by default terminated by a zero byte
|
161 |
(but see REG_STARTEND below), subject to the options in \fIeflags\fP. These can
|
162 |
be:
|
163 |
.sp
|
164 |
REG_NOTBOL
|
165 |
.sp
|
166 |
The PCRE_NOTBOL option is set when calling the underlying PCRE matching
|
167 |
function.
|
168 |
.sp
|
169 |
REG_NOTEOL
|
170 |
.sp
|
171 |
The PCRE_NOTEOL option is set when calling the underlying PCRE matching
|
172 |
function.
|
173 |
.sp
|
174 |
REG_STARTEND
|
175 |
.sp
|
176 |
The string is considered to start at \fIstring\fP + \fIpmatch[0].rm_so\fP and
|
177 |
to have a terminating NUL located at \fIstring\fP + \fIpmatch[0].rm_eo\fP
|
178 |
(there need not actually be a NUL at that location), regardless of the value of
|
179 |
\fInmatch\fP. This is a BSD extension, compatible with but not specified by
|
180 |
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in software
|
181 |
intended to be portable to other systems. Note that a non-zero \fIrm_so\fP does
|
182 |
not imply REG_NOTBOL; REG_STARTEND affects only the location of the string, not
|
183 |
how it is matched.
|
184 |
.P
|
185 |
If the pattern was compiled with the REG_NOSUB flag, no data about any matched
|
186 |
strings is returned. The \fInmatch\fP and \fIpmatch\fP arguments of
|
187 |
\fBregexec()\fP are ignored.
|
188 |
.P
|
189 |
Otherwise,the portion of the string that was matched, and also any captured
|
190 |
substrings, are returned via the \fIpmatch\fP argument, which points to an
|
191 |
array of \fInmatch\fP structures of type \fIregmatch_t\fP, containing the
|
192 |
members \fIrm_so\fP and \fIrm_eo\fP. These contain the offset to the first
|
193 |
character of each substring and the offset to the first character after the end
|
194 |
of each substring, respectively. The 0th element of the vector relates to the
|
195 |
entire portion of \fIstring\fP that was matched; subsequent elements relate to
|
196 |
the capturing subpatterns of the regular expression. Unused entries in the
|
197 |
array have both structure members set to -1.
|
198 |
.P
|
199 |
A successful match yields a zero return; various error codes are defined in the
|
200 |
header file, of which REG_NOMATCH is the "expected" failure code.
|
201 |
.
|
202 |
.
|
203 |
.SH "ERROR MESSAGES"
|
204 |
.rs
|
205 |
.sp
|
206 |
The \fBregerror()\fP function maps a non-zero errorcode from either
|
207 |
\fBregcomp()\fP or \fBregexec()\fP to a printable message. If \fIpreg\fP is not
|
208 |
NULL, the error should have arisen from the use of that structure. A message
|
209 |
terminated by a binary zero is placed in \fIerrbuf\fP. The length of the
|
210 |
message, including the zero, is limited to \fIerrbuf_size\fP. The yield of the
|
211 |
function is the size of buffer needed to hold the whole message.
|
212 |
.
|
213 |
.
|
214 |
.SH MEMORY USAGE
|
215 |
.rs
|
216 |
.sp
|
217 |
Compiling a regular expression causes memory to be allocated and associated
|
218 |
with the \fIpreg\fP structure. The function \fBregfree()\fP frees all such
|
219 |
memory, after which \fIpreg\fP may no longer be used as a compiled expression.
|
220 |
.
|
221 |
.
|
222 |
.SH AUTHOR
|
223 |
.rs
|
224 |
.sp
|
225 |
.nf
|
226 |
Philip Hazel
|
227 |
University Computing Service
|
228 |
Cambridge CB2 3QH, England.
|
229 |
.fi
|
230 |
.
|
231 |
.
|
232 |
.SH REVISION
|
233 |
.rs
|
234 |
.sp
|
235 |
.nf
|
236 |
Last updated: 05 April 2008
|
237 |
Copyright (c) 1997-2008 University of Cambridge.
|
238 |
.fi
|