--- code/branches/pcre16/doc/html/pcrepattern.html 2011/12/12 12:15:17 800 +++ code/branches/pcre16/doc/html/pcrepattern.html 2011/12/12 16:23:37 801 @@ -268,7 +268,8 @@ \t tab (hex 09) \ddd character with octal code ddd, or back reference \xhh character with hex code hh - \x{hhh..} character with hex code hhh.. + \x{hhh..} character with hex code hhh.. (non-JavaScript mode) + \uhhhh character with hex code hhhh (JavaScript mode only) The precise effect of \cx is as follows: if x is a lower case letter, it is converted to upper case. Then bit 6 of the character (hex 40) is inverted. @@ -280,12 +281,12 @@ 0xc0 bits are flipped.)

-After \x, from zero to two hexadecimal digits are read (letters can be in -upper or lower case). Any number of hexadecimal digits may appear between \x{ -and }, but the value of the character code must be less than 256 in non-UTF-8 -mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in -hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code -point, which is 10FFFF. +By default, after \x, from zero to two hexadecimal digits are read (letters +can be in upper or lower case). Any number of hexadecimal digits may appear +between \x{ and }, but the value of the character code must be less than 256 +in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum +value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest +Unicode code point, which is 10FFFF.

If characters other than hexadecimal digits appear between \x{ and }, or if @@ -294,9 +295,17 @@ following digits, giving a character whose value is zero.

+If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is +as just described only when it is followed by two hexadecimal digits. +Otherwise, it matches a literal "x" character. In JavaScript mode, support for +code points greater than 256 is provided by \u, which must be followed by +four hexadecimal digits; otherwise it matches a literal "u" character. +

+

Characters whose value is less than 256 can be defined by either of the two -syntaxes for \x. There is no difference in the way they are handled. For -example, \xdc is exactly the same as \x{dc}. +syntaxes for \x (or by \u in JavaScript mode). There is no difference in the +way they are handled. For example, \xdc is exactly the same as \x{dc} (or +\u00dc in JavaScript mode).

After \0 up to two further octal digits are read. If there are fewer than two @@ -338,12 +347,25 @@

All the sequences that define a single character value can be used both inside -and outside character classes. In addition, inside a character class, the -sequence \b is interpreted as the backspace character (hex 08). The sequences -\B, \N, \R, and \X are not special inside a character class. Like any other -unrecognized escape sequences, they are treated as the literal characters "B", -"N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is -set. Outside a character class, these sequences have different meanings. +and outside character classes. In addition, inside a character class, \b is +interpreted as the backspace character (hex 08). +

+

+\N is not allowed in a character class. \B, \R, and \X are not special +inside a character class. Like other unrecognized escape sequences, they are +treated as the literal characters "B", "R", and "X" by default, but cause an +error if the PCRE_EXTRA option is set. Outside a character class, these +sequences have different meanings. +

+
+Unsupported escape sequences +
+

+In Perl, the sequences \l, \L, \u, and \U are recognized by its string +handler and used to modify the case of following characters. By default, PCRE +does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT +option is set, \U matches a "U" character, and \u can be used to define a +character by code point, as described in the previous section.


Absolute and relative back references @@ -389,7 +411,8 @@ There is also the single sequence \N, which matches a non-newline character. This is the same as the "." metacharacter -when PCRE_DOTALL is not set. +when PCRE_DOTALL is not set. Perl also uses \N to match characters by name; +PCRE does not support this.

Each pair of lower and upper case escape sequences partitions the complete set @@ -963,7 +986,8 @@

The escape sequence \N behaves like a dot, except that it is not affected by the PCRE_DOTALL option. In other words, it matches any character except one -that signifies the end of a line. +that signifies the end of a line. Perl also uses \N to match characters by +name; PCRE does not support this.


MATCHING A SINGLE BYTE

@@ -979,8 +1003,8 @@

PCRE does not allow \C to appear in lookbehind assertions -(described below), -because in UTF-8 mode this would make it impossible to calculate the length of +(described below) +in UTF-8 mode, because this would make it impossible to calculate the length of the lookbehind.

@@ -1926,10 +1950,10 @@ assertion fails.

-PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode) -to appear in lookbehind assertions, because it makes it impossible to calculate -the length of the lookbehind. The \X and \R escapes, which can match -different numbers of bytes, are also not permitted. +In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte, +even in UTF-8 mode) to appear in lookbehind assertions, because it makes it +impossible to calculate the length of the lookbehind. The \X and \R escapes, +which can match different numbers of bytes, are also not permitted.

"Subroutine" @@ -2511,10 +2535,11 @@ If any of these verbs are used in an assertion or in a subpattern that is called as a subroutine (whether or not recursively), their effect is confined to that subpattern; it does not extend to the surrounding pattern, with one -exception: a *MARK that is encountered in a positive assertion is passed -back (compare capturing parentheses in assertions). Note that such subpatterns -are processed as anchored at the point where they are tested. Note also that -Perl's treatment of subroutines is different in some cases. +exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in +a successful positive assertion is passed back when a match succeeds +(compare capturing parentheses in assertions). Note that such subpatterns are +processed as anchored at the point where they are tested. Note also that Perl's +treatment of subroutines is different in some cases.

The new verbs make use of what was previously invalid syntax: an opening @@ -2536,6 +2561,10 @@ when calling pcre_compile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).

+

+Experiments with Perl suggest that it too has similar optimizations, sometimes +leading to anomalous results. +


Verbs that act immediately
@@ -2583,17 +2612,17 @@ (*MARK) as you like in a pattern, and their names do not have to be unique.

-When a match succeeds, the name of the last-encountered (*MARK) is passed back -to the caller via the pcre_extra data structure, as described in the +When a match succeeds, the name of the last-encountered (*MARK) on the matching +path is passed back to the caller via the pcre_extra data structure, as +described in the section on pcre_extra in the pcreapi -documentation. No data is returned for a partial match. Here is an example of -pcretest output, where the /K modifier requests the retrieval and -outputting of (*MARK) data: +documentation. Here is an example of pcretest output, where the /K +modifier requests the retrieval and outputting of (*MARK) data:

-  /X(*MARK:A)Y|X(*MARK:B)Z/K
-  XY
+    re> /X(*MARK:A)Y|X(*MARK:B)Z/K
+  data> XY
    0: XY
   MK: A
   XZ
@@ -2611,32 +2640,17 @@
 assertions.
 

-A name may also be returned after a failed match if the final path through the -pattern involves (*MARK). However, unless (*MARK) used in conjunction with -(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the -starting point for matching is advanced, the final check is often with an empty -string, causing a failure before (*MARK) is reached. For example: +After a partial match or a failed match, the name of the last encountered +(*MARK) in the entire match process is returned. For example:

-  /X(*MARK:A)Y|X(*MARK:B)Z/K
-  XP
-  No match
-
-There are three potential starting points for this match (starting with X, -starting with P, and with an empty string). If the pattern is anchored, the -result is different: -
-  /^X(*MARK:A)Y|^X(*MARK:B)Z/K
-  XP
+    re> /X(*MARK:A)Y|X(*MARK:B)Z/K
+  data> XP
   No match, mark = B
 
-PCRE's start-of-match optimizations can also interfere with this. For example, -if, as a result of a call to pcre_study(), it knows the minimum -subject length for a match, a shorter subject will not be scanned at all. -

-

-Note that similar anomalies (though different in detail) exist in Perl, no -doubt for the same reasons. The use of (*MARK) data after a failed match of an -unanchored pattern is not recommended, unless (*COMMIT) is involved. +Note that in this unanchored example the mark is retained from the match +attempt that started at the letter "X". Subsequent match attempts starting at +"P" and then with an empty string do not get as far as the (*MARK) item, but +nevertheless do not reset it.


Verbs that act after backtracking @@ -2675,8 +2689,8 @@ unless PCRE's start-of-match optimizations are turned off, as shown in this pcretest example:
-  /(*COMMIT)abc/
-  xyzabc
+    re> /(*COMMIT)abc/
+  data> xyzabc
    0: abc
   xyzabc\Y
   No match
@@ -2697,10 +2711,8 @@
 the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
 (*PRUNE) is just an alternative to an atomic group or possessive quantifier,
 but there are some uses of (*PRUNE) that cannot be expressed in any other way.
-The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
-match fails completely; the name is passed back if this is the final attempt.
-(*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored
-pattern (*PRUNE) has the same effect as (*COMMIT).
+The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
+anchored pattern (*PRUNE) has the same effect as (*COMMIT).
 
   (*SKIP)
 
@@ -2726,8 +2738,7 @@ searched for the most recent (*MARK) that has the same name. If one is found, the "bumpalong" advance is to the subject position that corresponds to that (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a -matching name is found, normal "bumpalong" of one character happens (that is, -the (*SKIP) is ignored). +matching name is found, the (*SKIP) is ignored.
   (*THEN) or (*THEN:NAME)
 
@@ -2741,9 +2752,8 @@ If the COND1 pattern matches, FOO is tried (and possibly further items after the end of the group if FOO succeeds); on failure, the matcher skips to the second alternative and tries COND2, without backtracking into COND1. The -behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the -overall match fails. If (*THEN) is not inside an alternation, it acts like -(*PRUNE). +behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN). +If (*THEN) is not inside an alternation, it acts like (*PRUNE).

Note that a subpattern that does not contain a | character is just a part of @@ -2819,7 +2829,7 @@


REVISION

-Last updated: 19 October 2011 +Last updated: 29 November 2011
Copyright © 1997-2011 University of Cambridge.