--- code/trunk/doc/html/pcrepattern.html 2007/06/13 14:55:18 181 +++ code/trunk/doc/html/pcrepattern.html 2007/06/13 15:09:54 182 @@ -24,19 +24,20 @@
  • VERTICAL BAR
  • INTERNAL OPTION SETTING
  • SUBPATTERNS -
  • NAMED SUBPATTERNS -
  • REPETITION -
  • ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS -
  • BACK REFERENCES -
  • ASSERTIONS -
  • CONDITIONAL SUBPATTERNS -
  • COMMENTS -
  • RECURSIVE PATTERNS -
  • SUBPATTERNS AS SUBROUTINES -
  • CALLOUTS -
  • SEE ALSO -
  • AUTHOR -
  • REVISION +
  • DUPLICATE SUBPATTERN NUMBERS +
  • NAMED SUBPATTERNS +
  • REPETITION +
  • ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS +
  • BACK REFERENCES +
  • ASSERTIONS +
  • CONDITIONAL SUBPATTERNS +
  • COMMENTS +
  • RECURSIVE PATTERNS +
  • SUBPATTERNS AS SUBROUTINES +
  • CALLOUTS +
  • SEE ALSO +
  • AUTHOR +
  • REVISION
    PCRE REGULAR EXPRESSION DETAILS

    @@ -270,8 +271,12 @@

       \d     any decimal digit
       \D     any character that is not a decimal digit
    +  \h     any horizontal whitespace character
    +  \H     any character that is not a horizontal whitespace character
       \s     any whitespace character
       \S     any character that is not a whitespace character
    +  \v     any vertical whitespace character
    +  \V     any character that is not a vertical whitespace character
       \w     any "word" character
       \W     any "non-word" character
     
    @@ -287,9 +292,52 @@

    For compatibility with Perl, \s does not match the VT character (code 11). This makes it different from the the POSIX "space" class. The \s characters -are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is +are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is included in a Perl script, \s may match the VT character. In PCRE, it never -does.) +does. +

    +

    +In UTF-8 mode, characters with values greater than 128 never match \d, \s, or +\w, and always match \D, \S, and \W. This is true even when Unicode +character property support is available. These sequences retain their original +meanings from before UTF-8 support was available, mainly for efficiency +reasons. +

    +

    +The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the +other sequences, these do match certain high-valued codepoints in UTF-8 mode. +The horizontal space characters are: +

    +  U+0009     Horizontal tab
    +  U+0020     Space
    +  U+00A0     Non-break space
    +  U+1680     Ogham space mark
    +  U+180E     Mongolian vowel separator
    +  U+2000     En quad
    +  U+2001     Em quad
    +  U+2002     En space
    +  U+2003     Em space
    +  U+2004     Three-per-em space
    +  U+2005     Four-per-em space
    +  U+2006     Six-per-em space
    +  U+2007     Figure space
    +  U+2008     Punctuation space
    +  U+2009     Thin space
    +  U+200A     Hair space
    +  U+202F     Narrow no-break space
    +  U+205F     Medium mathematical space
    +  U+3000     Ideographic space
    +
    +The vertical space characters are: +
    +  U+000A     Linefeed
    +  U+000B     Vertical tab
    +  U+000C     Formfeed
    +  U+000D     Carriage return
    +  U+0085     Next line
    +  U+2028     Line separator
    +  U+2029     Paragraph separator
    +

    A "word" character is an underscore or any character less than 256 that is a @@ -301,20 +349,15 @@ pcreapi page). For example, in a French locale such as "fr_FR" in Unix-like systems, or "french" in Windows, some character codes greater than 128 are used for -accented letters, and these are matched by \w. -

    -

    -In UTF-8 mode, characters with values greater than 128 never match \d, \s, or -\w, and always match \D, \S, and \W. This is true even when Unicode -character property support is available. The use of locales with Unicode is -discouraged. +accented letters, and these are matched by \w. The use of locales with Unicode +is discouraged.


    Newline sequences

    Outside a character class, the escape sequence \R matches any Unicode newline -sequence. This is an extension to Perl. In non-UTF-8 mode \R is equivalent to +sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is equivalent to the following:

       (?>\r\n|\n|\x0b|\f|\r|\x85)
    @@ -966,7 +1009,38 @@
     is reached, an option setting in one branch does affect subsequent branches, so
     the above patterns match "SUNDAY" as well as "Saturday".
     

    -
    NAMED SUBPATTERNS
    +
    DUPLICATE SUBPATTERN NUMBERS
    +

    +Perl 5.10 introduced a feature whereby each alternative in a subpattern uses +the same numbers for its capturing parentheses. Such a subpattern starts with +(?| and is itself a non-capturing subpattern. For example, consider this +pattern: +

    +  (?|(Sat)ur|(Sun))day
    +
    +Because the two alternatives are inside a (?| group, both sets of capturing +parentheses are numbered one. Thus, when the pattern matches, you can look +at captured substring number one, whichever alternative matched. This construct +is useful when you want to capture part, but not all, of one of a number of +alternatives. Inside a (?| group, parentheses are numbered as usual, but the +number is reset at the start of each branch. The numbers of any capturing +buffers that follow the subpattern start after the highest number used in any +branch. The following example is taken from the Perl documentation. +The numbers underneath show in which buffer the captured content will be +stored. +
    +  # before  ---------------branch-reset----------- after
    +  / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
    +  # 1            2         2  3        2     3     4
    +
    +A backreference or a recursive call to a numbered subpattern always refers to +the first one in the pattern with the given number. +

    +

    +An alternative approach to using this "branch reset" feature is to use +duplicate named subpatterns, as described in the next section. +

    +
    NAMED SUBPATTERNS

    Identifying capturing parentheses by number is simple, but it can be very hard to keep track of the numbers in complicated regular expressions. Furthermore, @@ -1008,6 +1082,10 @@ (?<DN>Sat)(?:urday)?

    There are five capturing substrings, but only one is ever set after a match. +(An alternative way of solving this problem is to use a "branch reset" +subpattern, as described in the previous section.) +

    +

    The convenience function for extracting the data by name returns the substring for the first (and in this example, the only) subpattern of that name that matched. This saves searching to find which numbered subpattern it was. If you @@ -1017,7 +1095,7 @@ pcreapi documentation.

    -
    REPETITION
    +
    REPETITION

    Repetition is specified by quantifiers, which can follow any of the following items: @@ -1168,7 +1246,7 @@ matches "aba" the value of the second captured substring is "b".

    -
    ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
    +
    ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS

    With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") repetition, failure of what follows normally causes the repeated item to be @@ -1267,7 +1345,7 @@ sequences of non-digits cannot be broken, and failure happens quickly.

    -
    BACK REFERENCES
    +
    BACK REFERENCES

    Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing subpattern earlier @@ -1380,7 +1458,7 @@ done using alternation, as in the example above, or by a quantifier with a minimum of zero.

    -
    ASSERTIONS
    +
    ASSERTIONS

    An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters. The simple @@ -1540,7 +1618,7 @@ is another pattern that matches "foo" preceded by three digits and any three characters that are not "999".

    -
    CONDITIONAL SUBPATTERNS
    +
    CONDITIONAL SUBPATTERNS

    It is possible to cause the matching process to obey a subpattern conditionally or to choose between two alternative subpatterns, depending on @@ -1678,7 +1756,7 @@ against the second. This pattern matches strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.

    -
    COMMENTS
    +
    COMMENTS

    The sequence (?# marks the start of a comment that continues up to the next closing parenthesis. Nested parentheses are not permitted. The characters @@ -1689,7 +1767,7 @@ character class introduces a comment that continues to immediately after the next newline in the pattern.

    -
    RECURSIVE PATTERNS
    +
    RECURSIVE PATTERNS

    Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can @@ -1819,7 +1897,7 @@ different alternatives for the recursive and non-recursive cases. The (?R) item is the actual recursive call.

    -
    SUBPATTERNS AS SUBROUTINES
    +
    SUBPATTERNS AS SUBROUTINES

    If the syntax for a recursive subpattern reference (either by number or by name) is used outside the parentheses to which it refers, it operates like a @@ -1859,7 +1937,7 @@ It matches "abcabc". It does not match "abcABC" because the change of processing option does not affect the called subpattern.

    -
    CALLOUTS
    +
    CALLOUTS

    Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl code to be obeyed in the middle of matching a regular expression. This makes it @@ -1894,11 +1972,11 @@ pcrecallout documentation.

    -
    SEE ALSO
    +
    SEE ALSO

    pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).

    -
    AUTHOR
    +
    AUTHOR

    Philip Hazel
    @@ -1907,9 +1985,9 @@ Cambridge CB2 3QH, England.

    -
    REVISION
    +
    REVISION

    -Last updated: 29 May 2007 +Last updated: 13 June 2007
    Copyright © 1997-2007 University of Cambridge.