2911 |
The newline convention does not affect what the \R escape sequence |
The newline convention does not affect what the \R escape sequence |
2912 |
matches. By default, this is any Unicode newline sequence, for Perl |
matches. By default, this is any Unicode newline sequence, for Perl |
2913 |
compatibility. However, this can be changed; see the description of \R |
compatibility. However, this can be changed; see the description of \R |
2914 |
in the section entitled "Newline sequences" below. |
in the section entitled "Newline sequences" below. A change of \R set- |
2915 |
|
ting can be combined with a change of newline convention. |
2916 |
|
|
2917 |
|
|
2918 |
CHARACTERS AND METACHARACTERS |
CHARACTERS AND METACHARACTERS |
2919 |
|
|
2920 |
A regular expression is a pattern that is matched against a subject |
A regular expression is a pattern that is matched against a subject |
2921 |
string from left to right. Most characters stand for themselves in a |
string from left to right. Most characters stand for themselves in a |
2922 |
pattern, and match the corresponding characters in the subject. As a |
pattern, and match the corresponding characters in the subject. As a |
2923 |
trivial example, the pattern |
trivial example, the pattern |
2924 |
|
|
2925 |
The quick brown fox |
The quick brown fox |
2926 |
|
|
2927 |
matches a portion of a subject string that is identical to itself. When |
matches a portion of a subject string that is identical to itself. When |
2928 |
caseless matching is specified (the PCRE_CASELESS option), letters are |
caseless matching is specified (the PCRE_CASELESS option), letters are |
2929 |
matched independently of case. In UTF-8 mode, PCRE always understands |
matched independently of case. In UTF-8 mode, PCRE always understands |
2930 |
the concept of case for characters whose values are less than 128, so |
the concept of case for characters whose values are less than 128, so |
2931 |
caseless matching is always possible. For characters with higher val- |
caseless matching is always possible. For characters with higher val- |
2932 |
ues, the concept of case is supported if PCRE is compiled with Unicode |
ues, the concept of case is supported if PCRE is compiled with Unicode |
2933 |
property support, but not otherwise. If you want to use caseless |
property support, but not otherwise. If you want to use caseless |
2934 |
matching for characters 128 and above, you must ensure that PCRE is |
matching for characters 128 and above, you must ensure that PCRE is |
2935 |
compiled with Unicode property support as well as with UTF-8 support. |
compiled with Unicode property support as well as with UTF-8 support. |
2936 |
|
|
2937 |
The power of regular expressions comes from the ability to include |
The power of regular expressions comes from the ability to include |
2938 |
alternatives and repetitions in the pattern. These are encoded in the |
alternatives and repetitions in the pattern. These are encoded in the |
2939 |
pattern by the use of metacharacters, which do not stand for themselves |
pattern by the use of metacharacters, which do not stand for themselves |
2940 |
but instead are interpreted in some special way. |
but instead are interpreted in some special way. |
2941 |
|
|
2942 |
There are two different sets of metacharacters: those that are recog- |
There are two different sets of metacharacters: those that are recog- |
2943 |
nized anywhere in the pattern except within square brackets, and those |
nized anywhere in the pattern except within square brackets, and those |
2944 |
that are recognized within square brackets. Outside square brackets, |
that are recognized within square brackets. Outside square brackets, |
2945 |
the metacharacters are as follows: |
the metacharacters are as follows: |
2946 |
|
|
2947 |
\ general escape character with several uses |
\ general escape character with several uses |
2960 |
also "possessive quantifier" |
also "possessive quantifier" |
2961 |
{ start min/max quantifier |
{ start min/max quantifier |
2962 |
|
|
2963 |
Part of a pattern that is in square brackets is called a "character |
Part of a pattern that is in square brackets is called a "character |
2964 |
class". In a character class the only metacharacters are: |
class". In a character class the only metacharacters are: |
2965 |
|
|
2966 |
\ general escape character |
\ general escape character |
2970 |
syntax) |
syntax) |
2971 |
] terminates the character class |
] terminates the character class |
2972 |
|
|
2973 |
The following sections describe the use of each of the metacharacters. |
The following sections describe the use of each of the metacharacters. |
2974 |
|
|
2975 |
|
|
2976 |
BACKSLASH |
BACKSLASH |
2977 |
|
|
2978 |
The backslash character has several uses. Firstly, if it is followed by |
The backslash character has several uses. Firstly, if it is followed by |
2979 |
a non-alphanumeric character, it takes away any special meaning that |
a non-alphanumeric character, it takes away any special meaning that |
2980 |
character may have. This use of backslash as an escape character |
character may have. This use of backslash as an escape character |
2981 |
applies both inside and outside character classes. |
applies both inside and outside character classes. |
2982 |
|
|
2983 |
For example, if you want to match a * character, you write \* in the |
For example, if you want to match a * character, you write \* in the |
2984 |
pattern. This escaping action applies whether or not the following |
pattern. This escaping action applies whether or not the following |
2985 |
character would otherwise be interpreted as a metacharacter, so it is |
character would otherwise be interpreted as a metacharacter, so it is |
2986 |
always safe to precede a non-alphanumeric with backslash to specify |
always safe to precede a non-alphanumeric with backslash to specify |
2987 |
that it stands for itself. In particular, if you want to match a back- |
that it stands for itself. In particular, if you want to match a back- |
2988 |
slash, you write \\. |
slash, you write \\. |
2989 |
|
|
2990 |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
2991 |
the pattern (other than in a character class) and characters between a |
the pattern (other than in a character class) and characters between a |
2992 |
# outside a character class and the next newline are ignored. An escap- |
# outside a character class and the next newline are ignored. An escap- |
2993 |
ing backslash can be used to include a whitespace or # character as |
ing backslash can be used to include a whitespace or # character as |
2994 |
part of the pattern. |
part of the pattern. |
2995 |
|
|
2996 |
If you want to remove the special meaning from a sequence of charac- |
If you want to remove the special meaning from a sequence of charac- |
2997 |
ters, you can do so by putting them between \Q and \E. This is differ- |
ters, you can do so by putting them between \Q and \E. This is differ- |
2998 |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
2999 |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
3000 |
tion. Note the following examples: |
tion. Note the following examples: |
3001 |
|
|
3002 |
Pattern PCRE matches Perl matches |
Pattern PCRE matches Perl matches |
3006 |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
3007 |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
3008 |
|
|
3009 |
The \Q...\E sequence is recognized both inside and outside character |
The \Q...\E sequence is recognized both inside and outside character |
3010 |
classes. |
classes. |
3011 |
|
|
3012 |
Non-printing characters |
Non-printing characters |
3013 |
|
|
3014 |
A second use of backslash provides a way of encoding non-printing char- |
A second use of backslash provides a way of encoding non-printing char- |
3015 |
acters in patterns in a visible manner. There is no restriction on the |
acters in patterns in a visible manner. There is no restriction on the |
3016 |
appearance of non-printing characters, apart from the binary zero that |
appearance of non-printing characters, apart from the binary zero that |
3017 |
terminates a pattern, but when a pattern is being prepared by text |
terminates a pattern, but when a pattern is being prepared by text |
3018 |
editing, it is usually easier to use one of the following escape |
editing, it is usually easier to use one of the following escape |
3019 |
sequences than the binary character it represents: |
sequences than the binary character it represents: |
3020 |
|
|
3021 |
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
3029 |
\xhh character with hex code hh |
\xhh character with hex code hh |
3030 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. |
3031 |
|
|
3032 |
The precise effect of \cx is as follows: if x is a lower case letter, |
The precise effect of \cx is as follows: if x is a lower case letter, |
3033 |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
3034 |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
3035 |
becomes hex 7B. |
becomes hex 7B. |
3036 |
|
|
3037 |
After \x, from zero to two hexadecimal digits are read (letters can be |
After \x, from zero to two hexadecimal digits are read (letters can be |
3038 |
in upper or lower case). Any number of hexadecimal digits may appear |
in upper or lower case). Any number of hexadecimal digits may appear |
3039 |
between \x{ and }, but the value of the character code must be less |
between \x{ and }, but the value of the character code must be less |
3040 |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
3041 |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
3042 |
than the largest Unicode code point, which is 10FFFF. |
than the largest Unicode code point, which is 10FFFF. |
3043 |
|
|
3044 |
If characters other than hexadecimal digits appear between \x{ and }, |
If characters other than hexadecimal digits appear between \x{ and }, |
3045 |
or if there is no terminating }, this form of escape is not recognized. |
or if there is no terminating }, this form of escape is not recognized. |
3046 |
Instead, the initial \x will be interpreted as a basic hexadecimal |
Instead, the initial \x will be interpreted as a basic hexadecimal |
3047 |
escape, with no following digits, giving a character whose value is |
escape, with no following digits, giving a character whose value is |
3048 |
zero. |
zero. |
3049 |
|
|
3050 |
Characters whose value is less than 256 can be defined by either of the |
Characters whose value is less than 256 can be defined by either of the |
3051 |
two syntaxes for \x. There is no difference in the way they are han- |
two syntaxes for \x. There is no difference in the way they are han- |
3052 |
dled. For example, \xdc is exactly the same as \x{dc}. |
dled. For example, \xdc is exactly the same as \x{dc}. |
3053 |
|
|
3054 |
After \0 up to two further octal digits are read. If there are fewer |
After \0 up to two further octal digits are read. If there are fewer |
3055 |
than two digits, just those that are present are used. Thus the |
than two digits, just those that are present are used. Thus the |
3056 |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
3057 |
(code value 7). Make sure you supply two digits after the initial zero |
(code value 7). Make sure you supply two digits after the initial zero |
3058 |
if the pattern character that follows is itself an octal digit. |
if the pattern character that follows is itself an octal digit. |
3059 |
|
|
3060 |
The handling of a backslash followed by a digit other than 0 is compli- |
The handling of a backslash followed by a digit other than 0 is compli- |
3061 |
cated. Outside a character class, PCRE reads it and any following dig- |
cated. Outside a character class, PCRE reads it and any following dig- |
3062 |
its as a decimal number. If the number is less than 10, or if there |
its as a decimal number. If the number is less than 10, or if there |
3063 |
have been at least that many previous capturing left parentheses in the |
have been at least that many previous capturing left parentheses in the |
3064 |
expression, the entire sequence is taken as a back reference. A |
expression, the entire sequence is taken as a back reference. A |
3065 |
description of how this works is given later, following the discussion |
description of how this works is given later, following the discussion |
3066 |
of parenthesized subpatterns. |
of parenthesized subpatterns. |
3067 |
|
|
3068 |
Inside a character class, or if the decimal number is greater than 9 |
Inside a character class, or if the decimal number is greater than 9 |
3069 |
and there have not been that many capturing subpatterns, PCRE re-reads |
and there have not been that many capturing subpatterns, PCRE re-reads |
3070 |
up to three octal digits following the backslash, and uses them to gen- |
up to three octal digits following the backslash, and uses them to gen- |
3071 |
erate a data character. Any subsequent digits stand for themselves. In |
erate a data character. Any subsequent digits stand for themselves. In |
3072 |
non-UTF-8 mode, the value of a character specified in octal must be |
non-UTF-8 mode, the value of a character specified in octal must be |
3073 |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
3074 |
example: |
example: |
3075 |
|
|
3076 |
\040 is another way of writing a space |
\040 is another way of writing a space |
3088 |
\81 is either a back reference, or a binary zero |
\81 is either a back reference, or a binary zero |
3089 |
followed by the two characters "8" and "1" |
followed by the two characters "8" and "1" |
3090 |
|
|
3091 |
Note that octal values of 100 or greater must not be introduced by a |
Note that octal values of 100 or greater must not be introduced by a |
3092 |
leading zero, because no more than three octal digits are ever read. |
leading zero, because no more than three octal digits are ever read. |
3093 |
|
|
3094 |
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
3095 |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
3096 |
class, the sequence \b is interpreted as the backspace character (hex |
class, the sequence \b is interpreted as the backspace character (hex |
3097 |
08), and the sequences \R and \X are interpreted as the characters "R" |
08), and the sequences \R and \X are interpreted as the characters "R" |
3098 |
and "X", respectively. Outside a character class, these sequences have |
and "X", respectively. Outside a character class, these sequences have |
3099 |
different meanings (see below). |
different meanings (see below). |
3100 |
|
|
3101 |
Absolute and relative back references |
Absolute and relative back references |
3102 |
|
|
3103 |
The sequence \g followed by an unsigned or a negative number, option- |
The sequence \g followed by an unsigned or a negative number, option- |
3104 |
ally enclosed in braces, is an absolute or relative back reference. A |
ally enclosed in braces, is an absolute or relative back reference. A |
3105 |
named back reference can be coded as \g{name}. Back references are dis- |
named back reference can be coded as \g{name}. Back references are dis- |
3106 |
cussed later, following the discussion of parenthesized subpatterns. |
cussed later, following the discussion of parenthesized subpatterns. |
3107 |
|
|
3122 |
\W any "non-word" character |
\W any "non-word" character |
3123 |
|
|
3124 |
Each pair of escape sequences partitions the complete set of characters |
Each pair of escape sequences partitions the complete set of characters |
3125 |
into two disjoint sets. Any given character matches one, and only one, |
into two disjoint sets. Any given character matches one, and only one, |
3126 |
of each pair. |
of each pair. |
3127 |
|
|
3128 |
These character type sequences can appear both inside and outside char- |
These character type sequences can appear both inside and outside char- |
3129 |
acter classes. They each match one character of the appropriate type. |
acter classes. They each match one character of the appropriate type. |
3130 |
If the current matching point is at the end of the subject string, all |
If the current matching point is at the end of the subject string, all |
3131 |
of them fail, since there is no character to match. |
of them fail, since there is no character to match. |
3132 |
|
|
3133 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
3134 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
3135 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
3136 |
"use locale;" is included in a Perl script, \s may match the VT charac- |
"use locale;" is included in a Perl script, \s may match the VT charac- |
3137 |
ter. In PCRE, it never does. |
ter. In PCRE, it never does. |
3138 |
|
|
3139 |
In UTF-8 mode, characters with values greater than 128 never match \d, |
In UTF-8 mode, characters with values greater than 128 never match \d, |
3140 |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
3141 |
code character property support is available. These sequences retain |
code character property support is available. These sequences retain |
3142 |
their original meanings from before UTF-8 support was available, mainly |
their original meanings from before UTF-8 support was available, mainly |
3143 |
for efficiency reasons. |
for efficiency reasons. |
3144 |
|
|
3145 |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
3146 |
the other sequences, these do match certain high-valued codepoints in |
the other sequences, these do match certain high-valued codepoints in |
3147 |
UTF-8 mode. The horizontal space characters are: |
UTF-8 mode. The horizontal space characters are: |
3148 |
|
|
3149 |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
3177 |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
3178 |
|
|
3179 |
A "word" character is an underscore or any character less than 256 that |
A "word" character is an underscore or any character less than 256 that |
3180 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
3181 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
3182 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
3183 |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
3184 |
systems, or "french" in Windows, some character codes greater than 128 |
systems, or "french" in Windows, some character codes greater than 128 |
3185 |
are used for accented letters, and these are matched by \w. The use of |
are used for accented letters, and these are matched by \w. The use of |
3186 |
locales with Unicode is discouraged. |
locales with Unicode is discouraged. |
3187 |
|
|
3188 |
Newline sequences |
Newline sequences |
3189 |
|
|
3190 |
Outside a character class, by default, the escape sequence \R matches |
Outside a character class, by default, the escape sequence \R matches |
3191 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
3192 |
mode \R is equivalent to the following: |
mode \R is equivalent to the following: |
3193 |
|
|
3194 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
3195 |
|
|
3196 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
3197 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
3198 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
3199 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
3200 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
3201 |
is treated as a single unit that cannot be split. |
is treated as a single unit that cannot be split. |
3202 |
|
|
3203 |
In UTF-8 mode, two additional characters whose codepoints are greater |
In UTF-8 mode, two additional characters whose codepoints are greater |
3204 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
3205 |
rator, U+2029). Unicode character property support is not needed for |
rator, U+2029). Unicode character property support is not needed for |
3206 |
these characters to be recognized. |
these characters to be recognized. |
3207 |
|
|
3208 |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
3209 |
the complete set of Unicode line endings) by setting the option |
the complete set of Unicode line endings) by setting the option |
3210 |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
3211 |
This can be made the default when PCRE is built; if this is the case, |
(BSR is an abbrevation for "backslash R".) This can be made the default |
3212 |
the other behaviour can be requested via the PCRE_BSR_UNICODE option. |
when PCRE is built; if this is the case, the other behaviour can be |
3213 |
It is also possible to specify these settings by starting a pattern |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
3214 |
string with one of the following sequences: |
specify these settings by starting a pattern string with one of the |
3215 |
|
following sequences: |
3216 |
|
|
3217 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
3218 |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
3221 |
they can be overridden by options given to pcre_exec(). Note that these |
they can be overridden by options given to pcre_exec(). Note that these |
3222 |
special settings, which are not Perl-compatible, are recognized only at |
special settings, which are not Perl-compatible, are recognized only at |
3223 |
the very start of a pattern, and that they must be in upper case. If |
the very start of a pattern, and that they must be in upper case. If |
3224 |
more than one of them is present, the last one is used. |
more than one of them is present, the last one is used. They can be |
3225 |
|
combined with a change of newline convention, for example, a pattern |
3226 |
|
can start with: |
3227 |
|
|
3228 |
|
(*ANY)(*BSR_ANYCRLF) |
3229 |
|
|
3230 |
Inside a character class, \R matches the letter "R". |
Inside a character class, \R matches the letter "R". |
3231 |
|
|
4856 |
|
|
4857 |
REVISION |
REVISION |
4858 |
|
|
4859 |
Last updated: 11 September 2007 |
Last updated: 14 September 2007 |
4860 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
4861 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4862 |
|
|