--- code/trunk/doc/html/pcrepattern.html 2007/06/13 14:55:18 181 +++ code/trunk/doc/html/pcrepattern.html 2007/06/13 15:09:54 182 @@ -24,19 +24,20 @@
@@ -270,8 +271,12 @@
\d any decimal digit \D any character that is not a decimal digit + \h any horizontal whitespace character + \H any character that is not a horizontal whitespace character \s any whitespace character \S any character that is not a whitespace character + \v any vertical whitespace character + \V any character that is not a vertical whitespace character \w any "word" character \W any "non-word" character@@ -287,9 +292,52 @@
For compatibility with Perl, \s does not match the VT character (code 11). This makes it different from the the POSIX "space" class. The \s characters -are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is +are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is included in a Perl script, \s may match the VT character. In PCRE, it never -does.) +does. +
++In UTF-8 mode, characters with values greater than 128 never match \d, \s, or +\w, and always match \D, \S, and \W. This is true even when Unicode +character property support is available. These sequences retain their original +meanings from before UTF-8 support was available, mainly for efficiency +reasons. +
++The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the +other sequences, these do match certain high-valued codepoints in UTF-8 mode. +The horizontal space characters are: +
+ U+0009 Horizontal tab + U+0020 Space + U+00A0 Non-break space + U+1680 Ogham space mark + U+180E Mongolian vowel separator + U+2000 En quad + U+2001 Em quad + U+2002 En space + U+2003 Em space + U+2004 Three-per-em space + U+2005 Four-per-em space + U+2006 Six-per-em space + U+2007 Figure space + U+2008 Punctuation space + U+2009 Thin space + U+200A Hair space + U+202F Narrow no-break space + U+205F Medium mathematical space + U+3000 Ideographic space ++The vertical space characters are: +
+ U+000A Linefeed + U+000B Vertical tab + U+000C Formfeed + U+000D Carriage return + U+0085 Next line + U+2028 Line separator + U+2029 Paragraph separator +
A "word" character is an underscore or any character less than 256 that is a @@ -301,20 +349,15 @@ pcreapi page). For example, in a French locale such as "fr_FR" in Unix-like systems, or "french" in Windows, some character codes greater than 128 are used for -accented letters, and these are matched by \w. -
--In UTF-8 mode, characters with values greater than 128 never match \d, \s, or -\w, and always match \D, \S, and \W. This is true even when Unicode -character property support is available. The use of locales with Unicode is -discouraged. +accented letters, and these are matched by \w. The use of locales with Unicode +is discouraged.
Outside a character class, the escape sequence \R matches any Unicode newline -sequence. This is an extension to Perl. In non-UTF-8 mode \R is equivalent to +sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is equivalent to the following:
(?>\r\n|\n|\x0b|\f|\r|\x85) @@ -966,7 +1009,38 @@ is reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday". -There are five capturing substrings, but only one is ever set after a match. +(An alternative way of solving this problem is to use a "branch reset" +subpattern, as described in the previous section.) + +
NAMED SUBPATTERNS
+
DUPLICATE SUBPATTERN NUMBERS
++Perl 5.10 introduced a feature whereby each alternative in a subpattern uses +the same numbers for its capturing parentheses. Such a subpattern starts with +(?| and is itself a non-capturing subpattern. For example, consider this +pattern: +
+ (?|(Sat)ur|(Sun))day ++Because the two alternatives are inside a (?| group, both sets of capturing +parentheses are numbered one. Thus, when the pattern matches, you can look +at captured substring number one, whichever alternative matched. This construct +is useful when you want to capture part, but not all, of one of a number of +alternatives. Inside a (?| group, parentheses are numbered as usual, but the +number is reset at the start of each branch. The numbers of any capturing +buffers that follow the subpattern start after the highest number used in any +branch. The following example is taken from the Perl documentation. +The numbers underneath show in which buffer the captured content will be +stored. ++ # before ---------------branch-reset----------- after + / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x + # 1 2 2 3 2 3 4 ++A backreference or a recursive call to a numbered subpattern always refers to +the first one in the pattern with the given number. + ++An alternative approach to using this "branch reset" feature is to use +duplicate named subpatterns, as described in the next section. +
+
NAMED SUBPATTERNS
Identifying capturing parentheses by number is simple, but it can be very hard to keep track of the numbers in complicated regular expressions. Furthermore, @@ -1008,6 +1082,10 @@ (?<DN>Sat)(?:urday)?
The convenience function for extracting the data by name returns the substring for the first (and in this example, the only) subpattern of that name that matched. This saves searching to find which numbered subpattern it was. If you @@ -1017,7 +1095,7 @@ pcreapi documentation.
-Repetition is specified by quantifiers, which can follow any of the following items: @@ -1168,7 +1246,7 @@ matches "aba" the value of the second captured substring is "b".
-With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") repetition, failure of what follows normally causes the repeated item to be @@ -1267,7 +1345,7 @@ sequences of non-digits cannot be broken, and failure happens quickly.
-Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing subpattern earlier @@ -1380,7 +1458,7 @@ done using alternation, as in the example above, or by a quantifier with a minimum of zero.
-An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters. The simple @@ -1540,7 +1618,7 @@ is another pattern that matches "foo" preceded by three digits and any three characters that are not "999".
-It is possible to cause the matching process to obey a subpattern conditionally or to choose between two alternative subpatterns, depending on @@ -1678,7 +1756,7 @@ against the second. This pattern matches strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
-The sequence (?# marks the start of a comment that continues up to the next closing parenthesis. Nested parentheses are not permitted. The characters @@ -1689,7 +1767,7 @@ character class introduces a comment that continues to immediately after the next newline in the pattern.
-Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can @@ -1819,7 +1897,7 @@ different alternatives for the recursive and non-recursive cases. The (?R) item is the actual recursive call.
-If the syntax for a recursive subpattern reference (either by number or by name) is used outside the parentheses to which it refers, it operates like a @@ -1859,7 +1937,7 @@ It matches "abcabc". It does not match "abcABC" because the change of processing option does not affect the called subpattern.
-Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl code to be obeyed in the middle of matching a regular expression. This makes it @@ -1894,11 +1972,11 @@ pcrecallout documentation.
-pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
-
Philip Hazel
@@ -1907,9 +1985,9 @@
Cambridge CB2 3QH, England.
-Last updated: 29 May 2007
+Last updated: 13 June 2007
Copyright © 1997-2007 University of Cambridge.