262 |
.sp |
.sp |
263 |
\ed any decimal digit |
\ed any decimal digit |
264 |
\eD any character that is not a decimal digit |
\eD any character that is not a decimal digit |
265 |
|
\eh any horizontal whitespace character |
266 |
|
\eH any character that is not a horizontal whitespace character |
267 |
\es any whitespace character |
\es any whitespace character |
268 |
\eS any character that is not a whitespace character |
\eS any character that is not a whitespace character |
269 |
|
\ev any vertical whitespace character |
270 |
|
\eV any character that is not a vertical whitespace character |
271 |
\ew any "word" character |
\ew any "word" character |
272 |
\eW any "non-word" character |
\eW any "non-word" character |
273 |
.sp |
.sp |
281 |
.P |
.P |
282 |
For compatibility with Perl, \es does not match the VT character (code 11). |
For compatibility with Perl, \es does not match the VT character (code 11). |
283 |
This makes it different from the the POSIX "space" class. The \es characters |
This makes it different from the the POSIX "space" class. The \es characters |
284 |
are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is |
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is |
285 |
included in a Perl script, \es may match the VT character. In PCRE, it never |
included in a Perl script, \es may match the VT character. In PCRE, it never |
286 |
does.) |
does. |
287 |
|
.P |
288 |
|
In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or |
289 |
|
\ew, and always match \eD, \eS, and \eW. This is true even when Unicode |
290 |
|
character property support is available. These sequences retain their original |
291 |
|
meanings from before UTF-8 support was available, mainly for efficiency |
292 |
|
reasons. |
293 |
|
.P |
294 |
|
The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the |
295 |
|
other sequences, these do match certain high-valued codepoints in UTF-8 mode. |
296 |
|
The horizontal space characters are: |
297 |
|
.sp |
298 |
|
U+0009 Horizontal tab |
299 |
|
U+0020 Space |
300 |
|
U+00A0 Non-break space |
301 |
|
U+1680 Ogham space mark |
302 |
|
U+180E Mongolian vowel separator |
303 |
|
U+2000 En quad |
304 |
|
U+2001 Em quad |
305 |
|
U+2002 En space |
306 |
|
U+2003 Em space |
307 |
|
U+2004 Three-per-em space |
308 |
|
U+2005 Four-per-em space |
309 |
|
U+2006 Six-per-em space |
310 |
|
U+2007 Figure space |
311 |
|
U+2008 Punctuation space |
312 |
|
U+2009 Thin space |
313 |
|
U+200A Hair space |
314 |
|
U+202F Narrow no-break space |
315 |
|
U+205F Medium mathematical space |
316 |
|
U+3000 Ideographic space |
317 |
|
.sp |
318 |
|
The vertical space characters are: |
319 |
|
.sp |
320 |
|
U+000A Linefeed |
321 |
|
U+000B Vertical tab |
322 |
|
U+000C Formfeed |
323 |
|
U+000D Carriage return |
324 |
|
U+0085 Next line |
325 |
|
U+2028 Line separator |
326 |
|
U+2029 Paragraph separator |
327 |
.P |
.P |
328 |
A "word" character is an underscore or any character less than 256 that is a |
A "word" character is an underscore or any character less than 256 that is a |
329 |
letter or digit. The definition of letters and digits is controlled by PCRE's |
letter or digit. The definition of letters and digits is controlled by PCRE's |
339 |
.\" |
.\" |
340 |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
341 |
or "french" in Windows, some character codes greater than 128 are used for |
or "french" in Windows, some character codes greater than 128 are used for |
342 |
accented letters, and these are matched by \ew. |
accented letters, and these are matched by \ew. The use of locales with Unicode |
343 |
.P |
is discouraged. |
|
In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or |
|
|
\ew, and always match \eD, \eS, and \eW. This is true even when Unicode |
|
|
character property support is available. The use of locales with Unicode is |
|
|
discouraged. |
|
344 |
. |
. |
345 |
. |
. |
346 |
.SS "Newline sequences" |
.SS "Newline sequences" |
347 |
.rs |
.rs |
348 |
.sp |
.sp |
349 |
Outside a character class, the escape sequence \eR matches any Unicode newline |
Outside a character class, the escape sequence \eR matches any Unicode newline |
350 |
sequence. This is an extension to Perl. In non-UTF-8 mode \eR is equivalent to |
sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \eR is equivalent to |
351 |
the following: |
the following: |
352 |
.sp |
.sp |
353 |
(?>\er\en|\en|\ex0b|\ef|\er|\ex85) |
(?>\er\en|\en|\ex0b|\ef|\er|\ex85) |
1001 |
.SH "DUPLICATE SUBPATTERN NUMBERS" |
.SH "DUPLICATE SUBPATTERN NUMBERS" |
1002 |
.rs |
.rs |
1003 |
.sp |
.sp |
1004 |
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses |
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses |
1005 |
the same numbers for its capturing parentheses. Such a subpattern starts with |
the same numbers for its capturing parentheses. Such a subpattern starts with |
1006 |
(?| and is itself a non-capturing subpattern. For example, consider this |
(?| and is itself a non-capturing subpattern. For example, consider this |
1007 |
pattern: |
pattern: |
1008 |
.sp |
.sp |
1009 |
(?|(Sat)ur|(Sun))day |
(?|(Sat)ur|(Sun))day |
1010 |
.sp |
.sp |
1011 |
Because the two alternatives are inside a (?| group, both sets of capturing |
Because the two alternatives are inside a (?| group, both sets of capturing |
1012 |
parentheses are numbered one. Thus, when the pattern matches, you can look |
parentheses are numbered one. Thus, when the pattern matches, you can look |
1013 |
at captured substring number one, whichever alternative matched. This construct |
at captured substring number one, whichever alternative matched. This construct |
1014 |
is useful when you want to capture part, but not all, of one of a number of |
is useful when you want to capture part, but not all, of one of a number of |
1015 |
alternatives. Inside a (?| group, parentheses are numbered as usual, but the |
alternatives. Inside a (?| group, parentheses are numbered as usual, but the |
1016 |
number is reset at the start of each branch. The numbers of any capturing |
number is reset at the start of each branch. The numbers of any capturing |
1017 |
buffers that follow the subpattern start after the highest number used in any |
buffers that follow the subpattern start after the highest number used in any |
1018 |
branch. The following example is taken from the Perl documentation. |
branch. The following example is taken from the Perl documentation. |
1019 |
The numbers underneath show in which buffer the captured content will be |
The numbers underneath show in which buffer the captured content will be |
1020 |
stored. |
stored. |
1021 |
.sp |
.sp |
1022 |
# before ---------------branch-reset----------- after |
# before ---------------branch-reset----------- after |
1023 |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
1024 |
# 1 2 2 3 2 3 4 |
# 1 2 2 3 2 3 4 |
1025 |
.sp |
.sp |
1026 |
A backreference or a recursive call to a numbered subpattern always refers to |
A backreference or a recursive call to a numbered subpattern always refers to |
1027 |
the first one in the pattern with the given number. |
the first one in the pattern with the given number. |
1028 |
.P |
.P |
1079 |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
1080 |
.sp |
.sp |
1081 |
There are five capturing substrings, but only one is ever set after a match. |
There are five capturing substrings, but only one is ever set after a match. |
1082 |
(An alternative way of solving this problem is to use a "branch reset" |
(An alternative way of solving this problem is to use a "branch reset" |
1083 |
subpattern, as described in the previous section.) |
subpattern, as described in the previous section.) |
1084 |
.P |
.P |
1085 |
The convenience function for extracting the data by name returns the substring |
The convenience function for extracting the data by name returns the substring |
1973 |
.rs |
.rs |
1974 |
.sp |
.sp |
1975 |
.nf |
.nf |
1976 |
Last updated: 11 June 2007 |
Last updated: 13 June 2007 |
1977 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
1978 |
.fi |
.fi |