44 |
.B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);" |
.B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);" |
45 |
.PP |
.PP |
46 |
.br |
.br |
47 |
|
.B void pcre_free_substring(const char *\fIstringptr\fR); |
48 |
|
.PP |
49 |
|
.br |
50 |
|
.B void pcre_free_substring_list(const char **\fIstringptr\fR); |
51 |
|
.PP |
52 |
|
.br |
53 |
.B const unsigned char *pcre_maketables(void); |
.B const unsigned char *pcre_maketables(void); |
54 |
.PP |
.PP |
55 |
.br |
.br |
76 |
The PCRE library is a set of functions that implement regular expression |
The PCRE library is a set of functions that implement regular expression |
77 |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
78 |
differences (see below). The current implementation corresponds to Perl 5.005, |
differences (see below). The current implementation corresponds to Perl 5.005, |
79 |
with some additional features from the Perl development release. |
with some additional features from later versions. This includes some |
80 |
|
experimental, incomplete support for UTF-8 encoded strings. Details of exactly |
81 |
|
what is and what is not supported are given below. |
82 |
|
|
83 |
PCRE has its own native API, which is described in this document. There is also |
PCRE has its own native API, which is described in this document. There is also |
84 |
a set of wrapper functions that correspond to the POSIX regular expression API. |
a set of wrapper functions that correspond to the POSIX regular expression API. |
92 |
use these to include support for different releases. |
use these to include support for different releases. |
93 |
|
|
94 |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
95 |
are used for compiling and matching regular expressions, while |
are used for compiling and matching regular expressions. |
96 |
\fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
|
97 |
|
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
98 |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
99 |
captured substrings from a matched subject string. The function |
captured substrings from a matched subject string; \fBpcre_free_substring()\fR |
100 |
\fBpcre_maketables()\fR is used (optionally) to build a set of character tables |
and \fBpcre_free_substring_list()\fR are also provided, to free the memory used |
101 |
in the current locale for passing to \fBpcre_compile()\fR. |
for extracted strings. |
102 |
|
|
103 |
|
The function \fBpcre_maketables()\fR is used (optionally) to build a set of |
104 |
|
character tables in the current locale for passing to \fBpcre_compile()\fR. |
105 |
|
|
106 |
The function \fBpcre_fullinfo()\fR is used to find out information about a |
The function \fBpcre_fullinfo()\fR is used to find out information about a |
107 |
compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only |
compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only |
235 |
greedy by default, but become greedy if followed by "?". It is not compatible |
greedy by default, but become greedy if followed by "?". It is not compatible |
236 |
with Perl. It can also be set by a (?U) option setting within the pattern. |
with Perl. It can also be set by a (?U) option setting within the pattern. |
237 |
|
|
238 |
|
PCRE_UTF8 |
239 |
|
|
240 |
|
This option causes PCRE to regard both the pattern and the subject as strings |
241 |
|
of UTF-8 characters instead of just byte strings. However, it is available only |
242 |
|
if PCRE has been built to include UTF-8 support. If not, the use of this option |
243 |
|
provokes an error. Support for UTF-8 is new, experimental, and incomplete. |
244 |
|
Details of exactly what it entails are given below. |
245 |
|
|
246 |
|
|
247 |
.SH STUDYING A PATTERN |
.SH STUDYING A PATTERN |
248 |
When a pattern is going to be used several times, it is worth spending more |
When a pattern is going to be used several times, it is worth spending more |
578 |
value of zero extracts the substring that matched the entire pattern, while |
value of zero extracts the substring that matched the entire pattern, while |
579 |
higher values extract the captured substrings. For \fBpcre_copy_substring()\fR, |
higher values extract the captured substrings. For \fBpcre_copy_substring()\fR, |
580 |
the string is placed in \fIbuffer\fR, whose length is given by |
the string is placed in \fIbuffer\fR, whose length is given by |
581 |
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of store is |
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of memory is |
582 |
obtained via \fBpcre_malloc\fR, and its address is returned via |
obtained via \fBpcre_malloc\fR, and its address is returned via |
583 |
\fIstringptr\fR. The yield of the function is the length of the string, not |
\fIstringptr\fR. The yield of the function is the length of the string, not |
584 |
including the terminating zero, or one of |
including the terminating zero, or one of |
610 |
inspecting the appropriate offset in \fIovector\fR, which is negative for unset |
inspecting the appropriate offset in \fIovector\fR, which is negative for unset |
611 |
substrings. |
substrings. |
612 |
|
|
613 |
|
The two convenience functions \fBpcre_free_substring()\fR and |
614 |
|
\fBpcre_free_substring_list()\fR can be used to free the memory returned by |
615 |
|
a previous call of \fBpcre_get_substring()\fR or |
616 |
|
\fBpcre_get_substring_list()\fR, respectively. They do nothing more than call |
617 |
|
the function pointed to by \fBpcre_free\fR, which of course could be called |
618 |
|
directly from a C program. However, PCRE is used in some situations where it is |
619 |
|
linked via a special interface to another programming language which cannot use |
620 |
|
\fBpcre_free\fR directly; it is for these cases that the functions are |
621 |
|
provided. |
622 |
|
|
623 |
|
|
624 |
.SH LIMITATIONS |
.SH LIMITATIONS |
720 |
described below. Regular expressions are also described in the Perl |
described below. Regular expressions are also described in the Perl |
721 |
documentation and in a number of other books, some of which have copious |
documentation and in a number of other books, some of which have copious |
722 |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
723 |
O'Reilly (ISBN 1-56592-257), covers them in great detail. The description |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
724 |
here is intended as reference documentation. |
|
725 |
|
The description here is intended as reference documentation. The basic |
726 |
|
operation of PCRE is on strings of bytes. However, there is the beginnings of |
727 |
|
some support for UTF-8 character strings. To use this support you must |
728 |
|
configure PCRE to include it, and then call \fBpcre_compile()\fR with the |
729 |
|
PCRE_UTF8 option. How this affects the pattern matching is described in the |
730 |
|
final section of this document. |
731 |
|
|
732 |
A regular expression is a pattern that is matched against a subject string from |
A regular expression is a pattern that is matched against a subject string from |
733 |
left to right. Most characters stand for themselves in a pattern, and match the |
left to right. Most characters stand for themselves in a pattern, and match the |
1346 |
|
|
1347 |
(a|b\\1)+ |
(a|b\\1)+ |
1348 |
|
|
1349 |
matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of |
1350 |
the subpattern, the back reference matches the character string corresponding |
the subpattern, the back reference matches the character string corresponding |
1351 |
to the previous iteration. In order for this to work, the pattern must be such |
to the previous iteration. In order for this to work, the pattern must be such |
1352 |
that the first iteration does not need to match the back reference. This can be |
that the first iteration does not need to match the back reference. This can be |
1720 |
applied to a whole line of "a" characters, whereas the latter takes an |
applied to a whole line of "a" characters, whereas the latter takes an |
1721 |
appreciable time with strings longer than about 20 characters. |
appreciable time with strings longer than about 20 characters. |
1722 |
|
|
1723 |
|
|
1724 |
|
.SH UTF-8 SUPPORT |
1725 |
|
Starting at release 3.3, PCRE has some support for character strings encoded |
1726 |
|
in the UTF-8 format. This is incomplete, and is regarded as experimental. In |
1727 |
|
order to use it, you must configure PCRE to include UTF-8 support in the code, |
1728 |
|
and, in addition, you must call \fBpcre_compile()\fR with the PCRE_UTF8 option |
1729 |
|
flag. When you do this, both the pattern and any subject strings that are |
1730 |
|
matched against it are treated as UTF-8 strings instead of just strings of |
1731 |
|
bytes, but only in the cases that are mentioned below. |
1732 |
|
|
1733 |
|
If you compile PCRE with UTF-8 support, but do not use it at run time, the |
1734 |
|
library will be a bit bigger, but the additional run time overhead is limited |
1735 |
|
to testing the PCRE_UTF8 flag in several places, so should not be very large. |
1736 |
|
|
1737 |
|
PCRE assumes that the strings it is given contain valid UTF-8 codes. It does |
1738 |
|
not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE, |
1739 |
|
the results are undefined. |
1740 |
|
|
1741 |
|
Running with PCRE_UTF8 set causes these changes in the way PCRE works: |
1742 |
|
|
1743 |
|
1. In a pattern, the escape sequence \\x{...}, where the contents of the braces |
1744 |
|
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose |
1745 |
|
code number is the given hexadecimal number, for example: \\x{1234}. This |
1746 |
|
inserts from one to six literal bytes into the pattern, using the UTF-8 |
1747 |
|
encoding. If a non-hexadecimal digit appears between the braces, the item is |
1748 |
|
not recognized. |
1749 |
|
|
1750 |
|
2. The original hexadecimal escape sequence, \\xhh, generates a two-byte UTF-8 |
1751 |
|
character if its value is greater than 127. |
1752 |
|
|
1753 |
|
3. Repeat quantifiers are NOT correctly handled if they follow a multibyte |
1754 |
|
character. For example, \\x{100}* and \\xc3+ do not work. If you want to |
1755 |
|
repeat such characters, you must enclose them in non-capturing parentheses, |
1756 |
|
for example (?:\\x{100}), at present. |
1757 |
|
|
1758 |
|
4. The dot metacharacter matches one UTF-8 character instead of a single byte. |
1759 |
|
|
1760 |
|
5. Unlike literal UTF-8 characters, the dot metacharacter followed by a |
1761 |
|
repeat quantifier does operate correctly on UTF-8 characters instead of |
1762 |
|
single bytes. |
1763 |
|
|
1764 |
|
4. Although the \\x{...} escape is permitted in a character class, characters |
1765 |
|
whose values are greater than 255 cannot be included in a class. |
1766 |
|
|
1767 |
|
5. A class is matched against a UTF-8 character instead of just a single byte, |
1768 |
|
but it can match only characters whose values are less than 256. Characters |
1769 |
|
with greater values always fail to match a class. |
1770 |
|
|
1771 |
|
6. Repeated classes work correctly on multiple characters. |
1772 |
|
|
1773 |
|
7. Classes containing just a single character whose value is greater than 127 |
1774 |
|
(but less than 256), for example, [\\x80] or [^\\x{93}], do not work because |
1775 |
|
these are optimized into single byte matches. In the first case, of course, |
1776 |
|
the class brackets are just redundant. |
1777 |
|
|
1778 |
|
8. Lookbehind assertions move backwards in the subject by a fixed number of |
1779 |
|
characters instead of a fixed number of bytes. Simple cases have been tested |
1780 |
|
to work correctly, but there may be hidden gotchas herein. |
1781 |
|
|
1782 |
|
9. The character types such as \\d and \\w do not work correctly with UTF-8 |
1783 |
|
characters. They continue to test a single byte. |
1784 |
|
|
1785 |
|
10. Anything not explicitly mentioned here continues to work in bytes rather |
1786 |
|
than in characters. |
1787 |
|
|
1788 |
|
The following UTF-8 features of Perl 5.6 are not implemented: |
1789 |
|
|
1790 |
|
1. The escape sequence \\C to match a single byte. |
1791 |
|
|
1792 |
|
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X. |
1793 |
|
|
1794 |
.SH AUTHOR |
.SH AUTHOR |
1795 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
1796 |
.br |
.br |
1802 |
.br |
.br |
1803 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
1804 |
|
|
1805 |
Last updated: 27 January 2000 |
Last updated: 28 August 2000, |
1806 |
|
.br |
1807 |
|
the 250th anniversary of the death of J.S. Bach. |
1808 |
.br |
.br |
1809 |
Copyright (c) 1997-2000 University of Cambridge. |
Copyright (c) 1997-2000 University of Cambridge. |