1 |
.TH PCRECPP 3
|
2 |
.SH NAME
|
3 |
PCRE - Perl-compatible regular expressions.
|
4 |
.SH "SYNOPSIS OF C++ WRAPPER"
|
5 |
.rs
|
6 |
.sp
|
7 |
.B #include <pcrecpp.h>
|
8 |
.
|
9 |
.SH DESCRIPTION
|
10 |
.rs
|
11 |
.sp
|
12 |
The C++ wrapper for PCRE was provided by Google Inc. Some additional
|
13 |
functionality was added by Giuseppe Maxia. This brief man page was constructed
|
14 |
from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
|
15 |
further details.
|
16 |
.
|
17 |
.
|
18 |
.SH "MATCHING INTERFACE"
|
19 |
.rs
|
20 |
.sp
|
21 |
The "FullMatch" operation checks that supplied text matches a supplied pattern
|
22 |
exactly. If pointer arguments are supplied, it copies matched sub-strings that
|
23 |
match sub-patterns into them.
|
24 |
.sp
|
25 |
Example: successful match
|
26 |
pcrecpp::RE re("h.*o");
|
27 |
re.FullMatch("hello");
|
28 |
.sp
|
29 |
Example: unsuccessful match (requires full match):
|
30 |
pcrecpp::RE re("e");
|
31 |
!re.FullMatch("hello");
|
32 |
.sp
|
33 |
Example: creating a temporary RE object:
|
34 |
pcrecpp::RE("h.*o").FullMatch("hello");
|
35 |
.sp
|
36 |
You can pass in a "const char*" or a "string" for "text". The examples below
|
37 |
tend to use a const char*. You can, as in the different examples above, store
|
38 |
the RE object explicitly in a variable or use a temporary RE object. The
|
39 |
examples below use one mode or the other arbitrarily. Either could correctly be
|
40 |
used for any of these examples.
|
41 |
.P
|
42 |
You must supply extra pointer arguments to extract matched subpieces.
|
43 |
.sp
|
44 |
Example: extracts "ruby" into "s" and 1234 into "i"
|
45 |
int i;
|
46 |
string s;
|
47 |
pcrecpp::RE re("(\e\ew+):(\e\ed+)");
|
48 |
re.FullMatch("ruby:1234", &s, &i);
|
49 |
.sp
|
50 |
Example: does not try to extract any extra sub-patterns
|
51 |
re.FullMatch("ruby:1234", &s);
|
52 |
.sp
|
53 |
Example: does not try to extract into NULL
|
54 |
re.FullMatch("ruby:1234", NULL, &i);
|
55 |
.sp
|
56 |
Example: integer overflow causes failure
|
57 |
!re.FullMatch("ruby:1234567891234", NULL, &i);
|
58 |
.sp
|
59 |
Example: fails because there aren't enough sub-patterns:
|
60 |
!pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s);
|
61 |
.sp
|
62 |
Example: fails because string cannot be stored in integer
|
63 |
!pcrecpp::RE("(.*)").FullMatch("ruby", &i);
|
64 |
.sp
|
65 |
The provided pointer arguments can be pointers to any scalar numeric
|
66 |
type, or one of:
|
67 |
.sp
|
68 |
string (matched piece is copied to string)
|
69 |
StringPiece (StringPiece is mutated to point to matched piece)
|
70 |
T (where "bool T::ParseFrom(const char*, int)" exists)
|
71 |
NULL (the corresponding matched sub-pattern is not copied)
|
72 |
.sp
|
73 |
The function returns true iff all of the following conditions are satisfied:
|
74 |
.sp
|
75 |
a. "text" matches "pattern" exactly;
|
76 |
.sp
|
77 |
b. The number of matched sub-patterns is >= number of supplied
|
78 |
pointers;
|
79 |
.sp
|
80 |
c. The "i"th argument has a suitable type for holding the
|
81 |
string captured as the "i"th sub-pattern. If you pass in
|
82 |
NULL for the "i"th argument, or pass fewer arguments than
|
83 |
number of sub-patterns, "i"th captured sub-pattern is
|
84 |
ignored.
|
85 |
.sp
|
86 |
CAVEAT: An optional sub-pattern that does not exist in the matched
|
87 |
string is assigned the empty string. Therefore, the following will
|
88 |
return false (because the empty string is not a valid number):
|
89 |
.sp
|
90 |
int number;
|
91 |
pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
|
92 |
.sp
|
93 |
The matching interface supports at most 16 arguments per call.
|
94 |
If you need more, consider using the more general interface
|
95 |
\fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for
|
96 |
\fBDoMatch\fP.
|
97 |
.
|
98 |
.SH "QUOTING METACHARACTERS"
|
99 |
.rs
|
100 |
.sp
|
101 |
You can use the "QuoteMeta" operation to insert backslashes before all
|
102 |
potentially meaningful characters in a string. The returned string, used as a
|
103 |
regular expression, will exactly match the original string.
|
104 |
.sp
|
105 |
Example:
|
106 |
string quoted = RE::QuoteMeta(unquoted);
|
107 |
.sp
|
108 |
Note that it's legal to escape a character even if it has no special meaning in
|
109 |
a regular expression -- so this function does that. (This also makes it
|
110 |
identical to the perl function of the same name; see "perldoc -f quotemeta".)
|
111 |
For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?".
|
112 |
.
|
113 |
.SH "PARTIAL MATCHES"
|
114 |
.rs
|
115 |
.sp
|
116 |
You can use the "PartialMatch" operation when you want the pattern
|
117 |
to match any substring of the text.
|
118 |
.sp
|
119 |
Example: simple search for a string:
|
120 |
pcrecpp::RE("ell").PartialMatch("hello");
|
121 |
.sp
|
122 |
Example: find first number in a string:
|
123 |
int number;
|
124 |
pcrecpp::RE re("(\e\ed+)");
|
125 |
re.PartialMatch("x*100 + 20", &number);
|
126 |
assert(number == 100);
|
127 |
.
|
128 |
.
|
129 |
.SH "UTF-8 AND THE MATCHING INTERFACE"
|
130 |
.rs
|
131 |
.sp
|
132 |
By default, pattern and text are plain text, one byte per character. The UTF8
|
133 |
flag, passed to the constructor, causes both pattern and string to be treated
|
134 |
as UTF-8 text, still a byte stream but potentially multiple bytes per
|
135 |
character. In practice, the text is likelier to be UTF-8 than the pattern, but
|
136 |
the match returned may depend on the UTF8 flag, so always use it when matching
|
137 |
UTF8 text. For example, "." will match one byte normally but with UTF8 set may
|
138 |
match up to three bytes of a multi-byte character.
|
139 |
.sp
|
140 |
Example:
|
141 |
pcrecpp::RE_Options options;
|
142 |
options.set_utf8();
|
143 |
pcrecpp::RE re(utf8_pattern, options);
|
144 |
re.FullMatch(utf8_string);
|
145 |
.sp
|
146 |
Example: using the convenience function UTF8():
|
147 |
pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
|
148 |
re.FullMatch(utf8_string);
|
149 |
.sp
|
150 |
NOTE: The UTF8 flag is ignored if pcre was not configured with the
|
151 |
--enable-utf8 flag.
|
152 |
.
|
153 |
.
|
154 |
.SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE"
|
155 |
.rs
|
156 |
.sp
|
157 |
PCRE defines some modifiers to change the behavior of the regular expression
|
158 |
engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
|
159 |
pass such modifiers to a RE class. Currently, the following modifiers are
|
160 |
supported:
|
161 |
.sp
|
162 |
modifier description Perl corresponding
|
163 |
.sp
|
164 |
PCRE_CASELESS case insensitive match /i
|
165 |
PCRE_MULTILINE multiple lines match /m
|
166 |
PCRE_DOTALL dot matches newlines /s
|
167 |
PCRE_DOLLAR_ENDONLY $ matches only at end N/A
|
168 |
PCRE_EXTRA strict escape parsing N/A
|
169 |
PCRE_EXTENDED ignore whitespaces /x
|
170 |
PCRE_UTF8 handles UTF8 chars built-in
|
171 |
PCRE_UNGREEDY reverses * and *? N/A
|
172 |
PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
|
173 |
.sp
|
174 |
(*) Both Perl and PCRE allow non capturing parentheses by means of the
|
175 |
"?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
|
176 |
capture, while (ab|cd) does.
|
177 |
.P
|
178 |
For a full account on how each modifier works, please check the
|
179 |
PCRE API reference page.
|
180 |
.P
|
181 |
For each modifier, there are two member functions whose name is made
|
182 |
out of the modifier in lowercase, without the "PCRE_" prefix. For
|
183 |
instance, PCRE_CASELESS is handled by
|
184 |
.sp
|
185 |
bool caseless()
|
186 |
.sp
|
187 |
which returns true if the modifier is set, and
|
188 |
.sp
|
189 |
RE_Options & set_caseless(bool)
|
190 |
.sp
|
191 |
which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
|
192 |
accessed through the \fBset_match_limit()\fR and \fBmatch_limit()\fR member
|
193 |
functions. Setting \fImatch_limit\fR to a non-zero value will limit the
|
194 |
execution of pcre to keep it from doing bad things like blowing the stack or
|
195 |
taking an eternity to return a result. A value of 5000 is good enough to stop
|
196 |
stack blowup in a 2MB thread stack. Setting \fImatch_limit\fR to zero disables
|
197 |
match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP
|
198 |
which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
|
199 |
recurses. \fBmatch_limit()\fP limits the number of matches PCRE does;
|
200 |
\fBmatch_limit_recursion()\fP limits the depth of internal recursion, and
|
201 |
therefore the amount of stack that is used.
|
202 |
.P
|
203 |
Normally, to pass one or more modifiers to a RE class, you declare
|
204 |
a \fIRE_Options\fR object, set the appropriate options, and pass this
|
205 |
object to a RE constructor. Example:
|
206 |
.sp
|
207 |
RE_options opt;
|
208 |
opt.set_caseless(true);
|
209 |
if (RE("HELLO", opt).PartialMatch("hello world")) ...
|
210 |
.sp
|
211 |
RE_options has two constructors. The default constructor takes no arguments and
|
212 |
creates a set of flags that are off by default. The optional parameter
|
213 |
\fIoption_flags\fR is to facilitate transfer of legacy code from C programs.
|
214 |
This lets you do
|
215 |
.sp
|
216 |
RE(pattern,
|
217 |
RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
|
218 |
.sp
|
219 |
However, new code is better off doing
|
220 |
.sp
|
221 |
RE(pattern,
|
222 |
RE_Options().set_caseless(true).set_multiline(true))
|
223 |
.PartialMatch(str);
|
224 |
.sp
|
225 |
If you are going to pass one of the most used modifiers, there are some
|
226 |
convenience functions that return a RE_Options class with the
|
227 |
appropriate modifier already set: \fBCASELESS()\fR, \fBUTF8()\fR,
|
228 |
\fBMULTILINE()\fR, \fBDOTALL\fR(), and \fBEXTENDED()\fR.
|
229 |
.P
|
230 |
If you need to set several options at once, and you don't want to go through
|
231 |
the pains of declaring a RE_Options object and setting several options, there
|
232 |
is a parallel method that give you such ability on the fly. You can concatenate
|
233 |
several \fBset_xxxxx()\fR member functions, since each of them returns a
|
234 |
reference to its class object. For example, to pass PCRE_CASELESS,
|
235 |
PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
|
236 |
.sp
|
237 |
RE(" ^ xyz \e\es+ .* blah$",
|
238 |
RE_Options()
|
239 |
.set_caseless(true)
|
240 |
.set_extended(true)
|
241 |
.set_multiline(true)).PartialMatch(sometext);
|
242 |
.sp
|
243 |
.
|
244 |
.
|
245 |
.SH "SCANNING TEXT INCREMENTALLY"
|
246 |
.rs
|
247 |
.sp
|
248 |
The "Consume" operation may be useful if you want to repeatedly
|
249 |
match regular expressions at the front of a string and skip over
|
250 |
them as they match. This requires use of the "StringPiece" type,
|
251 |
which represents a sub-range of a real string. Like RE, StringPiece
|
252 |
is defined in the pcrecpp namespace.
|
253 |
.sp
|
254 |
Example: read lines of the form "var = value" from a string.
|
255 |
string contents = ...; // Fill string somehow
|
256 |
pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
|
257 |
|
258 |
string var;
|
259 |
int value;
|
260 |
pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en");
|
261 |
while (re.Consume(&input, &var, &value)) {
|
262 |
...;
|
263 |
}
|
264 |
.sp
|
265 |
Each successful call to "Consume" will set "var/value", and also
|
266 |
advance "input" so it points past the matched text.
|
267 |
.P
|
268 |
The "FindAndConsume" operation is similar to "Consume" but does not
|
269 |
anchor your match at the beginning of the string. For example, you
|
270 |
could extract all words from a string by repeatedly calling
|
271 |
.sp
|
272 |
pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word)
|
273 |
.
|
274 |
.
|
275 |
.SH "PARSING HEX/OCTAL/C-RADIX NUMBERS"
|
276 |
.rs
|
277 |
.sp
|
278 |
By default, if you pass a pointer to a numeric value, the
|
279 |
corresponding text is interpreted as a base-10 number. You can
|
280 |
instead wrap the pointer with a call to one of the operators Hex(),
|
281 |
Octal(), or CRadix() to interpret the text in another base. The
|
282 |
CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
|
283 |
prefixes, but defaults to base-10.
|
284 |
.sp
|
285 |
Example:
|
286 |
int a, b, c, d;
|
287 |
pcrecpp::RE re("(.*) (.*) (.*) (.*)");
|
288 |
re.FullMatch("100 40 0100 0x40",
|
289 |
pcrecpp::Octal(&a), pcrecpp::Hex(&b),
|
290 |
pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
|
291 |
.sp
|
292 |
will leave 64 in a, b, c, and d.
|
293 |
.
|
294 |
.
|
295 |
.SH "REPLACING PARTS OF STRINGS"
|
296 |
.rs
|
297 |
.sp
|
298 |
You can replace the first match of "pattern" in "str" with "rewrite".
|
299 |
Within "rewrite", backslash-escaped digits (\e1 to \e9) can be
|
300 |
used to insert text matching corresponding parenthesized group
|
301 |
from the pattern. \e0 in "rewrite" refers to the entire matching
|
302 |
text. For example:
|
303 |
.sp
|
304 |
string s = "yabba dabba doo";
|
305 |
pcrecpp::RE("b+").Replace("d", &s);
|
306 |
.sp
|
307 |
will leave "s" containing "yada dabba doo". The result is true if the pattern
|
308 |
matches and a replacement occurs, false otherwise.
|
309 |
.P
|
310 |
\fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all
|
311 |
occurrences of the pattern in the string with the rewrite. Replacements are
|
312 |
not subject to re-matching. For example:
|
313 |
.sp
|
314 |
string s = "yabba dabba doo";
|
315 |
pcrecpp::RE("b+").GlobalReplace("d", &s);
|
316 |
.sp
|
317 |
will leave "s" containing "yada dada doo". It returns the number of
|
318 |
replacements made.
|
319 |
.P
|
320 |
\fBExtract\fP is like \fBReplace\fP, except that if the pattern matches,
|
321 |
"rewrite" is copied into "out" (an additional argument) with substitutions.
|
322 |
The non-matching portions of "text" are ignored. Returns true iff a match
|
323 |
occurred and the extraction happened successfully; if no match occurs, the
|
324 |
string is left unaffected.
|
325 |
.
|
326 |
.
|
327 |
.SH AUTHOR
|
328 |
.rs
|
329 |
.sp
|
330 |
.nf
|
331 |
The C++ wrapper was contributed by Google Inc.
|
332 |
Copyright (c) 2006 Google Inc.
|
333 |
.fi
|
334 |
.
|
335 |
.
|
336 |
.SH REVISION
|
337 |
.rs
|
338 |
.sp
|
339 |
.nf
|
340 |
Last updated: 06 March 2007
|
341 |
.fi
|