221 |
.SH NEWLINES |
.SH NEWLINES |
222 |
.rs |
.rs |
223 |
.sp |
.sp |
224 |
PCRE supports four different conventions for indicating line breaks in |
PCRE supports five different conventions for indicating line breaks in |
225 |
strings: a single CR (carriage return) character, a single LF (linefeed) |
strings: a single CR (carriage return) character, a single LF (linefeed) |
226 |
character, the two-character sequence CRLF, or any Unicode newline sequence. |
character, the two-character sequence CRLF, any of the three preceding, or any |
227 |
The Unicode newline sequences are the three just mentioned, plus the single |
Unicode newline sequence. The Unicode newline sequences are the three just |
228 |
characters VT (vertical tab, U+000B), FF (formfeed, U+000C), NEL (next line, |
mentioned, plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
229 |
U+0085), LS (line separator, U+2028), and PS (paragraph separator, U+2029). |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
230 |
|
(paragraph separator, U+2029). |
231 |
.P |
.P |
232 |
Each of the first three conventions is used by at least one operating system as |
Each of the first three conventions is used by at least one operating system as |
233 |
its standard newline sequence. When PCRE is built, a default can be specified. |
its standard newline sequence. When PCRE is built, a default can be specified. |
265 |
.\" HREF |
.\" HREF |
266 |
\fBpcreprecompile\fP |
\fBpcreprecompile\fP |
267 |
.\" |
.\" |
268 |
documentation. |
documentation. However, compiling a regular expression with one version of PCRE |
269 |
|
for use with a different version is not guaranteed to work and may cause |
270 |
|
crashes. |
271 |
. |
. |
272 |
. |
. |
273 |
.SH "CHECKING BUILD-TIME OPTIONS" |
.SH "CHECKING BUILD-TIME OPTIONS" |
300 |
.sp |
.sp |
301 |
The output is an integer whose value specifies the default character sequence |
The output is an integer whose value specifies the default character sequence |
302 |
that is recognized as meaning "newline". The four values that are supported |
that is recognized as meaning "newline". The four values that are supported |
303 |
are: 10 for LF, 13 for CR, 3338 for CRLF, and -1 for ANY. The default should |
are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, and -1 for ANY. The |
304 |
normally be the standard sequence for your operating system. |
default should normally be the standard sequence for your operating system. |
305 |
.sp |
.sp |
306 |
PCRE_CONFIG_LINK_SIZE |
PCRE_CONFIG_LINK_SIZE |
307 |
.sp |
.sp |
535 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
536 |
PCRE_NEWLINE_LF |
PCRE_NEWLINE_LF |
537 |
PCRE_NEWLINE_CRLF |
PCRE_NEWLINE_CRLF |
538 |
|
PCRE_NEWLINE_ANYCRLF |
539 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
540 |
.sp |
.sp |
541 |
These options override the default newline definition that was chosen when PCRE |
These options override the default newline definition that was chosen when PCRE |
542 |
was built. Setting the first or the second specifies that a newline is |
was built. Setting the first or the second specifies that a newline is |
543 |
indicated by a single character (CR or LF, respectively). Setting |
indicated by a single character (CR or LF, respectively). Setting |
544 |
PCRE_NEWLINE_CRLF specifies that a newline is indicated by the two-character |
PCRE_NEWLINE_CRLF specifies that a newline is indicated by the two-character |
545 |
CRLF sequence. Setting PCRE_NEWLINE_ANY specifies that any Unicode newline |
CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies that any of the three |
546 |
sequence should be recognized. The Unicode newline sequences are the three just |
preceding sequences should be recognized. Setting PCRE_NEWLINE_ANY specifies |
547 |
mentioned, plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
that any Unicode newline sequence should be recognized. The Unicode newline |
548 |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
sequences are the three just mentioned, plus the single characters VT (vertical |
549 |
(paragraph separator, U+2029). The last two are recognized only in UTF-8 mode. |
tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line |
550 |
|
separator, U+2028), and PS (paragraph separator, U+2029). The last two are |
551 |
|
recognized only in UTF-8 mode. |
552 |
.P |
.P |
553 |
The newline setting in the options word uses three bits that are treated |
The newline setting in the options word uses three bits that are treated |
554 |
as a number, giving eight possibilities. Currently only five are used (default |
as a number, giving eight possibilities. Currently only six are used (default |
555 |
plus the four values above). This means that if you set more than one newline |
plus the five values above). This means that if you set more than one newline |
556 |
option, the combination may or may not be sensible. For example, |
option, the combination may or may not be sensible. For example, |
557 |
PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to PCRE_NEWLINE_CRLF, but |
PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to PCRE_NEWLINE_CRLF, but |
558 |
other combinations yield unused numbers and cause an error. |
other combinations may yield unused numbers and cause an error. |
559 |
.P |
.P |
560 |
The only time that a line break is specially recognized when compiling a |
The only time that a line break is specially recognized when compiling a |
561 |
pattern is if PCRE_EXTENDED is set, and an unescaped # outside a character |
pattern is if PCRE_EXTENDED is set, and an unescaped # outside a character |
735 |
.SH "LOCALE SUPPORT" |
.SH "LOCALE SUPPORT" |
736 |
.rs |
.rs |
737 |
.sp |
.sp |
738 |
PCRE handles caseless matching, and determines whether characters are letters |
PCRE handles caseless matching, and determines whether characters are letters, |
739 |
digits, or whatever, by reference to a set of tables, indexed by character |
digits, or whatever, by reference to a set of tables, indexed by character |
740 |
value. When running in UTF-8 mode, this applies only to characters with codes |
value. When running in UTF-8 mode, this applies only to characters with codes |
741 |
less than 128. Higher-valued codes never match escapes such as \ew or \ed, but |
less than 128. Higher-valued codes never match escapes such as \ew or \ed, but |
742 |
can be tested with \ep if PCRE is built with Unicode character property |
can be tested with \ep if PCRE is built with Unicode character property |
743 |
support. The use of locales with Unicode is discouraged. |
support. The use of locales with Unicode is discouraged. If you are handling |
744 |
.P |
characters with codes greater than 128, you should either use UTF-8 and |
745 |
An internal set of tables is created in the default C locale when PCRE is |
Unicode, or use locales, but not try to mix the two. |
746 |
built. This is used when the final argument of \fBpcre_compile()\fP is NULL, |
.P |
747 |
and is sufficient for many applications. An alternative set of tables can, |
PCRE contains an internal set of tables that are used when the final argument |
748 |
however, be supplied. These may be created in a different locale from the |
of \fBpcre_compile()\fP is NULL. These are sufficient for many applications. |
749 |
default. As more and more applications change to using Unicode, the need for |
Normally, the internal tables recognize only ASCII characters. However, when |
750 |
this locale support is expected to die away. |
PCRE is built, it is possible to cause the internal tables to be rebuilt in the |
751 |
|
default "C" locale of the local system, which may cause them to be different. |
752 |
|
.P |
753 |
|
The internal tables can always be overridden by tables supplied by the |
754 |
|
application that calls PCRE. These may be created in a different locale from |
755 |
|
the default. As more and more applications change to using Unicode, the need |
756 |
|
for this locale support is expected to die away. |
757 |
.P |
.P |
758 |
External tables are built by calling the \fBpcre_maketables()\fP function, |
External tables are built by calling the \fBpcre_maketables()\fP function, |
759 |
which has no arguments, in the relevant locale. The result can then be passed |
which has no arguments, in the relevant locale. The result can then be passed |
766 |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
767 |
re = pcre_compile(..., tables); |
re = pcre_compile(..., tables); |
768 |
.sp |
.sp |
769 |
|
The locale name "fr_FR" is used on Linux and other Unix-like systems; if you |
770 |
|
are using Windows, the name for the French locale is "french". |
771 |
|
.P |
772 |
When \fBpcre_maketables()\fP runs, the tables are built in memory that is |
When \fBpcre_maketables()\fP runs, the tables are built in memory that is |
773 |
obtained via \fBpcre_malloc\fP. It is the caller's responsibility to ensure |
obtained via \fBpcre_malloc\fP. It is the caller's responsibility to ensure |
774 |
that the memory containing the tables remains available for as long as it is |
that the memory containing the tables remains available for as long as it is |
1156 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1157 |
PCRE_NEWLINE_LF |
PCRE_NEWLINE_LF |
1158 |
PCRE_NEWLINE_CRLF |
PCRE_NEWLINE_CRLF |
1159 |
|
PCRE_NEWLINE_ANYCRLF |
1160 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1161 |
.sp |
.sp |
1162 |
These options override the newline definition that was chosen or defaulted when |
These options override the newline definition that was chosen or defaulted when |
1164 |
\fBpcre_compile()\fP above. During matching, the newline choice affects the |
\fBpcre_compile()\fP above. During matching, the newline choice affects the |
1165 |
behaviour of the dot, circumflex, and dollar metacharacters. It may also alter |
behaviour of the dot, circumflex, and dollar metacharacters. It may also alter |
1166 |
the way the match position is advanced after a match failure for an unanchored |
the way the match position is advanced after a match failure for an unanchored |
1167 |
pattern. When PCRE_NEWLINE_CRLF or PCRE_NEWLINE_ANY is set, and a match attempt |
pattern. When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
1168 |
fails when the current position is at a CRLF sequence, the match position is |
set, and a match attempt fails when the current position is at a CRLF sequence, |
1169 |
advanced by two characters instead of one, in other words, to after the CRLF. |
the match position is advanced by two characters instead of one, in other |
1170 |
|
words, to after the CRLF. |
1171 |
.sp |
.sp |
1172 |
PCRE_NOTBOL |
PCRE_NOTBOL |
1173 |
.sp |
.sp |
1604 |
These functions call \fBpcre_get_stringnumber()\fP, and if it succeeds, they |
These functions call \fBpcre_get_stringnumber()\fP, and if it succeeds, they |
1605 |
then call \fBpcre_copy_substring()\fP or \fBpcre_get_substring()\fP, as |
then call \fBpcre_copy_substring()\fP or \fBpcre_get_substring()\fP, as |
1606 |
appropriate. \fBNOTE:\fP If PCRE_DUPNAMES is set and there are duplicate names, |
appropriate. \fBNOTE:\fP If PCRE_DUPNAMES is set and there are duplicate names, |
1607 |
the behaviour may not be what you want (see the next section). |
the behaviour may not be what you want (see the next section). |
1608 |
. |
. |
1609 |
. |
. |
1610 |
.SH "DUPLICATE SUBPATTERN NAMES" |
.SH "DUPLICATE SUBPATTERN NAMES" |
1851 |
.rs |
.rs |
1852 |
.sp |
.sp |
1853 |
.nf |
.nf |
1854 |
Last updated: 06 March 2007 |
Last updated: 24 April 2007 |
1855 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
1856 |
.fi |
.fi |