4352 |
|
|
4353 |
DESCRIPTION |
DESCRIPTION |
4354 |
|
|
4355 |
The C++ wrapper for PCRE was provided by Google Inc. This brief man |
The C++ wrapper for PCRE was provided by Google Inc. Some additional |
4356 |
page was constructed from the notes in the pcrecpp.h file, which should |
functionality was added by Giuseppe Maxia. This brief man page was con- |
4357 |
be consulted for further details. |
structed from the notes in the pcrecpp.h file, which should be con- |
4358 |
|
sulted for further details. |
4359 |
|
|
4360 |
|
|
4361 |
MATCHING INTERFACE |
MATCHING INTERFACE |
4362 |
|
|
4363 |
The "FullMatch" operation checks that supplied text matches a supplied |
The "FullMatch" operation checks that supplied text matches a supplied |
4364 |
pattern exactly. If pointer arguments are supplied, it copies matched |
pattern exactly. If pointer arguments are supplied, it copies matched |
4365 |
sub-strings that match sub-patterns into them. |
sub-strings that match sub-patterns into them. |
4366 |
|
|
4367 |
Example: successful match |
Example: successful match |
4375 |
Example: creating a temporary RE object: |
Example: creating a temporary RE object: |
4376 |
pcrecpp::RE("h.*o").FullMatch("hello"); |
pcrecpp::RE("h.*o").FullMatch("hello"); |
4377 |
|
|
4378 |
You can pass in a "const char*" or a "string" for "text". The examples |
You can pass in a "const char*" or a "string" for "text". The examples |
4379 |
below tend to use a const char*. You can, as in the different examples |
below tend to use a const char*. You can, as in the different examples |
4380 |
above, store the RE object explicitly in a variable or use a temporary |
above, store the RE object explicitly in a variable or use a temporary |
4381 |
RE object. The examples below use one mode or the other arbitrarily. |
RE object. The examples below use one mode or the other arbitrarily. |
4382 |
Either could correctly be used for any of these examples. |
Either could correctly be used for any of these examples. |
4383 |
|
|
4384 |
You must supply extra pointer arguments to extract matched subpieces. |
You must supply extra pointer arguments to extract matched subpieces. |
4404 |
Example: fails because string cannot be stored in integer |
Example: fails because string cannot be stored in integer |
4405 |
!pcrecpp::RE("(.*)").FullMatch("ruby", &i); |
!pcrecpp::RE("(.*)").FullMatch("ruby", &i); |
4406 |
|
|
4407 |
The provided pointer arguments can be pointers to any scalar numeric |
The provided pointer arguments can be pointers to any scalar numeric |
4408 |
type, or one of: |
type, or one of: |
4409 |
|
|
4410 |
string (matched piece is copied to string) |
string (matched piece is copied to string) |
4412 |
T (where "bool T::ParseFrom(const char*, int)" exists) |
T (where "bool T::ParseFrom(const char*, int)" exists) |
4413 |
NULL (the corresponding matched sub-pattern is not copied) |
NULL (the corresponding matched sub-pattern is not copied) |
4414 |
|
|
4415 |
The function returns true iff all of the following conditions are sat- |
The function returns true iff all of the following conditions are sat- |
4416 |
isfied: |
isfied: |
4417 |
|
|
4418 |
a. "text" matches "pattern" exactly; |
a. "text" matches "pattern" exactly; |
4426 |
number of sub-patterns, "i"th captured sub-pattern is |
number of sub-patterns, "i"th captured sub-pattern is |
4427 |
ignored. |
ignored. |
4428 |
|
|
4429 |
The matching interface supports at most 16 arguments per call. If you |
The matching interface supports at most 16 arguments per call. If you |
4430 |
need more, consider using the more general interface |
need more, consider using the more general interface |
4431 |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
4432 |
|
|
4433 |
|
|
4434 |
PARTIAL MATCHES |
PARTIAL MATCHES |
4435 |
|
|
4436 |
You can use the "PartialMatch" operation when you want the pattern to |
You can use the "PartialMatch" operation when you want the pattern to |
4437 |
match any substring of the text. |
match any substring of the text. |
4438 |
|
|
4439 |
Example: simple search for a string: |
Example: simple search for a string: |
4448 |
|
|
4449 |
UTF-8 AND THE MATCHING INTERFACE |
UTF-8 AND THE MATCHING INTERFACE |
4450 |
|
|
4451 |
By default, pattern and text are plain text, one byte per character. |
By default, pattern and text are plain text, one byte per character. |
4452 |
The UTF8 flag, passed to the constructor, causes both pattern and |
The UTF8 flag, passed to the constructor, causes both pattern and |
4453 |
string to be treated as UTF-8 text, still a byte stream but potentially |
string to be treated as UTF-8 text, still a byte stream but potentially |
4454 |
multiple bytes per character. In practice, the text is likelier to be |
multiple bytes per character. In practice, the text is likelier to be |
4455 |
UTF-8 than the pattern, but the match returned may depend on the UTF8 |
UTF-8 than the pattern, but the match returned may depend on the UTF8 |
4456 |
flag, so always use it when matching UTF8 text. For example, "." will |
flag, so always use it when matching UTF8 text. For example, "." will |
4457 |
match one byte normally but with UTF8 set may match up to three bytes |
match one byte normally but with UTF8 set may match up to three bytes |
4458 |
of a multi-byte character. |
of a multi-byte character. |
4459 |
|
|
4460 |
Example: |
Example: |
4471 |
--enable-utf8 flag. |
--enable-utf8 flag. |
4472 |
|
|
4473 |
|
|
4474 |
|
PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE |
4475 |
|
|
4476 |
|
PCRE defines some modifiers to change the behavior of the regular |
4477 |
|
expression engine. The C++ wrapper defines an auxiliary class, |
4478 |
|
RE_Options, as a vehicle to pass such modifiers to a RE class. Cur- |
4479 |
|
rently, the following modifiers are supported: |
4480 |
|
|
4481 |
|
modifier description Perl corresponding |
4482 |
|
|
4483 |
|
PCRE_CASELESS case insensitive match /i |
4484 |
|
PCRE_MULTILINE multiple lines match /m |
4485 |
|
PCRE_DOTALL dot matches newlines /s |
4486 |
|
PCRE_DOLLAR_ENDONLY $ matches only at end N/A |
4487 |
|
PCRE_EXTRA strict escape parsing N/A |
4488 |
|
PCRE_EXTENDED ignore whitespaces /x |
4489 |
|
PCRE_UTF8 handles UTF8 chars built-in |
4490 |
|
PCRE_UNGREEDY reverses * and *? N/A |
4491 |
|
PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) |
4492 |
|
|
4493 |
|
(*) Both Perl and PCRE allow non capturing parentheses by means of the |
4494 |
|
"?:" modifier within the pattern itself. e.g. (?:ab|cd) does not cap- |
4495 |
|
ture, while (ab|cd) does. |
4496 |
|
|
4497 |
|
For a full account on how each modifier works, please check the PCRE |
4498 |
|
API reference page. |
4499 |
|
|
4500 |
|
For each modifier, there are two member functions whose name is made |
4501 |
|
out of the modifier in lowercase, without the "PCRE_" prefix. For |
4502 |
|
instance, PCRE_CASELESS is handled by |
4503 |
|
|
4504 |
|
bool caseless() |
4505 |
|
|
4506 |
|
which returns true if the modifier is set, and |
4507 |
|
|
4508 |
|
RE_Options & set_caseless(bool) |
4509 |
|
|
4510 |
|
which sets or unsets the modifier. Moreover, PCRE_CONFIG_MATCH_LIMIT |
4511 |
|
can be accessed through the set_match_limit() and match_limit() member |
4512 |
|
functions. Setting match_limit to a non-zero value will limit the exe- |
4513 |
|
cution of pcre to keep it from doing bad things like blowing the stack |
4514 |
|
or taking an eternity to return a result. A value of 5000 is good |
4515 |
|
enough to stop stack blowup in a 2MB thread stack. Setting match_limit |
4516 |
|
to zero disables match limiting. |
4517 |
|
|
4518 |
|
Normally, to pass one or more modifiers to a RE class, you declare a |
4519 |
|
RE_Options object, set the appropriate options, and pass this object to |
4520 |
|
a RE constructor. Example: |
4521 |
|
|
4522 |
|
RE_options opt; |
4523 |
|
opt.set_caseless(true); |
4524 |
|
if (RE("HELLO", opt).PartialMatch("hello world")) ... |
4525 |
|
|
4526 |
|
RE_options has two constructors. The default constructor takes no argu- |
4527 |
|
ments and creates a set of flags that are off by default. The optional |
4528 |
|
parameter option_flags is to facilitate transfer of legacy code from C |
4529 |
|
programs. This lets you do |
4530 |
|
|
4531 |
|
RE(pattern, |
4532 |
|
RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str); |
4533 |
|
|
4534 |
|
However, new code is better off doing |
4535 |
|
|
4536 |
|
RE(pattern, |
4537 |
|
RE_Options().set_caseless(true).set_multiline(true)) |
4538 |
|
.PartialMatch(str); |
4539 |
|
|
4540 |
|
If you are going to pass one of the most used modifiers, there are some |
4541 |
|
convenience functions that return a RE_Options class with the appropri- |
4542 |
|
ate modifier already set: CASELESS(), UTF8(), MULTILINE(), DOTALL(), |
4543 |
|
and EXTENDED(). |
4544 |
|
|
4545 |
|
If you need to set several options at once, and you don't want to go |
4546 |
|
through the pains of declaring a RE_Options object and setting several |
4547 |
|
options, there is a parallel method that give you such ability on the |
4548 |
|
fly. You can concatenate several set_xxxxx() member functions, since |
4549 |
|
each of them returns a reference to its class object. For example, to |
4550 |
|
pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one |
4551 |
|
statement, you may write: |
4552 |
|
|
4553 |
|
RE(" ^ xyz \\s+ .* blah$", |
4554 |
|
RE_Options() |
4555 |
|
.set_caseless(true) |
4556 |
|
.set_extended(true) |
4557 |
|
.set_multiline(true)).PartialMatch(sometext); |
4558 |
|
|
4559 |
|
|
4560 |
SCANNING TEXT INCREMENTALLY |
SCANNING TEXT INCREMENTALLY |
4561 |
|
|
4562 |
The "Consume" operation may be useful if you want to repeatedly match |
The "Consume" operation may be useful if you want to repeatedly match |
4563 |
regular expressions at the front of a string and skip over them as they |
regular expressions at the front of a string and skip over them as they |
4564 |
match. This requires use of the "StringPiece" type, which represents a |
match. This requires use of the "StringPiece" type, which represents a |
4565 |
sub-range of a real string. Like RE, StringPiece is defined in the |
sub-range of a real string. Like RE, StringPiece is defined in the |
4566 |
pcrecpp namespace. |
pcrecpp namespace. |
4567 |
|
|
4568 |
Example: read lines of the form "var = value" from a string. |
Example: read lines of the form "var = value" from a string. |
4576 |
...; |
...; |
4577 |
} |
} |
4578 |
|
|
4579 |
Each successful call to "Consume" will set "var/value", and also |
Each successful call to "Consume" will set "var/value", and also |
4580 |
advance "input" so it points past the matched text. |
advance "input" so it points past the matched text. |
4581 |
|
|
4582 |
The "FindAndConsume" operation is similar to "Consume" but does not |
The "FindAndConsume" operation is similar to "Consume" but does not |
4583 |
anchor your match at the beginning of the string. For example, you |
anchor your match at the beginning of the string. For example, you |
4584 |
could extract all words from a string by repeatedly calling |
could extract all words from a string by repeatedly calling |
4585 |
|
|
4586 |
pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word) |
pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word) |
4589 |
PARSING HEX/OCTAL/C-RADIX NUMBERS |
PARSING HEX/OCTAL/C-RADIX NUMBERS |
4590 |
|
|
4591 |
By default, if you pass a pointer to a numeric value, the corresponding |
By default, if you pass a pointer to a numeric value, the corresponding |
4592 |
text is interpreted as a base-10 number. You can instead wrap the |
text is interpreted as a base-10 number. You can instead wrap the |
4593 |
pointer with a call to one of the operators Hex(), Octal(), or CRadix() |
pointer with a call to one of the operators Hex(), Octal(), or CRadix() |
4594 |
to interpret the text in another base. The CRadix operator interprets |
to interpret the text in another base. The CRadix operator interprets |
4595 |
C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to |
C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to |
4596 |
base-10. |
base-10. |
4597 |
|
|
4598 |
Example: |
Example: |
4607 |
|
|
4608 |
REPLACING PARTS OF STRINGS |
REPLACING PARTS OF STRINGS |
4609 |
|
|
4610 |
You can replace the first match of "pattern" in "str" with "rewrite". |
You can replace the first match of "pattern" in "str" with "rewrite". |
4611 |
Within "rewrite", backslash-escaped digits (\1 to \9) can be used to |
Within "rewrite", backslash-escaped digits (\1 to \9) can be used to |
4612 |
insert text matching corresponding parenthesized group from the pat- |
insert text matching corresponding parenthesized group from the pat- |
4613 |
tern. \0 in "rewrite" refers to the entire matching text. For example: |
tern. \0 in "rewrite" refers to the entire matching text. For example: |
4614 |
|
|
4615 |
string s = "yabba dabba doo"; |
string s = "yabba dabba doo"; |
4616 |
pcrecpp::RE("b+").Replace("d", &s); |
pcrecpp::RE("b+").Replace("d", &s); |
4617 |
|
|
4618 |
will leave "s" containing "yada dabba doo". The result is true if the |
will leave "s" containing "yada dabba doo". The result is true if the |
4619 |
pattern matches and a replacement occurs, false otherwise. |
pattern matches and a replacement occurs, false otherwise. |
4620 |
|
|
4621 |
GlobalReplace is like Replace except that it replaces all occurrences |
GlobalReplace is like Replace except that it replaces all occurrences |
4622 |
of the pattern in the string with the rewrite. Replacements are not |
of the pattern in the string with the rewrite. Replacements are not |
4623 |
subject to re-matching. For example: |
subject to re-matching. For example: |
4624 |
|
|
4625 |
string s = "yabba dabba doo"; |
string s = "yabba dabba doo"; |
4626 |
pcrecpp::RE("b+").GlobalReplace("d", &s); |
pcrecpp::RE("b+").GlobalReplace("d", &s); |
4627 |
|
|
4628 |
will leave "s" containing "yada dada doo". It returns the number of |
will leave "s" containing "yada dada doo". It returns the number of |
4629 |
replacements made. |
replacements made. |
4630 |
|
|
4631 |
Extract is like Replace, except that if the pattern matches, "rewrite" |
Extract is like Replace, except that if the pattern matches, "rewrite" |
4632 |
is copied into "out" (an additional argument) with substitutions. The |
is copied into "out" (an additional argument) with substitutions. The |
4633 |
non-matching portions of "text" are ignored. Returns true iff a match |
non-matching portions of "text" are ignored. Returns true iff a match |
4634 |
occurred and the extraction happened successfully; if no match occurs, |
occurred and the extraction happened successfully; if no match occurs, |
4635 |
the string is left unaffected. |
the string is left unaffected. |
4636 |
|
|