/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 716 by ph10, Tue Oct 4 16:38:05 2011 UTC revision 758 by ph10, Mon Nov 21 12:05:36 2011 UTC
# Line 241  one of the following escape sequences th Line 241  one of the following escape sequences th
241    \et        tab (hex 09)    \et        tab (hex 09)
242    \eddd      character with octal code ddd, or back reference    \eddd      character with octal code ddd, or back reference
243    \exhh      character with hex code hh    \exhh      character with hex code hh
244    \ex{hhh..} character with hex code hhh..    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
245      \euhhhh    character with hex code hhhh (JavaScript mode only)
246  .sp  .sp
247  The precise effect of \ecx is as follows: if x is a lower case letter, it  The precise effect of \ecx is as follows: if x is a lower case letter, it
248  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 252  both byte mode and UTF-8 mode. (When PCR Line 253  both byte mode and UTF-8 mode. (When PCR
253  values are valid. A lower case letter is converted to upper case, and then the  values are valid. A lower case letter is converted to upper case, and then the
254  0xc0 bits are flipped.)  0xc0 bits are flipped.)
255  .P  .P
256  After \ex, from zero to two hexadecimal digits are read (letters can be in  By default, after \ex, from zero to two hexadecimal digits are read (letters
257  upper or lower case). Any number of hexadecimal digits may appear between \ex{  can be in upper or lower case). Any number of hexadecimal digits may appear
258  and }, but the value of the character code must be less than 256 in non-UTF-8  between \ex{ and }, but the value of the character code must be less than 256
259  mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in  in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
260  hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code  value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
261  point, which is 10FFFF.  Unicode code point, which is 10FFFF.
262  .P  .P
263  If characters other than hexadecimal digits appear between \ex{ and }, or if  If characters other than hexadecimal digits appear between \ex{ and }, or if
264  there is no terminating }, this form of escape is not recognized. Instead, the  there is no terminating }, this form of escape is not recognized. Instead, the
265  initial \ex will be interpreted as a basic hexadecimal escape, with no  initial \ex will be interpreted as a basic hexadecimal escape, with no
266  following digits, giving a character whose value is zero.  following digits, giving a character whose value is zero.
267  .P  .P
268    If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
269    as just described only when it is followed by two hexadecimal digits.
270    Otherwise, it matches a literal "x" character. In JavaScript mode, support for
271    code points greater than 256 is provided by \eu, which must be followed by
272    four hexadecimal digits; otherwise it matches a literal "u" character.
273    .P
274  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
275  syntaxes for \ex. There is no difference in the way they are handled. For  syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
276  example, \exdc is exactly the same as \ex{dc}.  way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
277    \eu00dc in JavaScript mode).
278  .P  .P
279  After \e0 up to two further octal digits are read. If there are fewer than two  After \e0 up to two further octal digits are read. If there are fewer than two
280  digits, just those that are present are used. Thus the sequence \e0\ex\e07  digits, just those that are present are used. Thus the sequence \e0\ex\e07
# Line 320  Note that octal values of 100 or greater Line 328  Note that octal values of 100 or greater
328  zero, because no more than three octal digits are ever read.  zero, because no more than three octal digits are ever read.
329  .P  .P
330  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
331  and outside character classes. In addition, inside a character class, the  and outside character classes. In addition, inside a character class, \eb is
332  sequence \eb is interpreted as the backspace character (hex 08). The sequences  interpreted as the backspace character (hex 08).
333  \eB, \eN, \eR, and \eX are not special inside a character class. Like any other  .P
334  unrecognized escape sequences, they are treated as the literal characters "B",  \eN is not allowed in a character class. \eB, \eR, and \eX are not special
335  "N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is  inside a character class. Like other unrecognized escape sequences, they are
336  set. Outside a character class, these sequences have different meanings.  treated as the literal characters "B", "R", and "X" by default, but cause an
337    error if the PCRE_EXTRA option is set. Outside a character class, these
338    sequences have different meanings.
339    .
340    .
341    .SS "Unsupported escape sequences"
342    .rs
343    .sp
344    In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
345    handler and used to modify the case of following characters. By default, PCRE
346    does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
347    option is set, \eU matches a "U" character, and \eu can be used to define a
348    character by code point, as described in the previous section.
349  .  .
350  .  .
351  .SS "Absolute and relative back references"  .SS "Absolute and relative back references"
# Line 387  This is the same as Line 407  This is the same as
407  .\" </a>  .\" </a>
408  the "." metacharacter  the "." metacharacter
409  .\"  .\"
410  when PCRE_DOTALL is not set.  when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
411    PCRE does not support this.
412  .P  .P
413  Each pair of lower and upper case escape sequences partitions the complete set  Each pair of lower and upper case escape sequences partitions the complete set
414  of characters into two disjoint sets. Any given character matches one, and only  of characters into two disjoint sets. Any given character matches one, and only
# Line 964  special meaning in a character class. Line 985  special meaning in a character class.
985  .P  .P
986  The escape sequence \eN behaves like a dot, except that it is not affected by  The escape sequence \eN behaves like a dot, except that it is not affected by
987  the PCRE_DOTALL option. In other words, it matches any character except one  the PCRE_DOTALL option. In other words, it matches any character except one
988  that signifies the end of a line.  that signifies the end of a line. Perl also uses \eN to match characters by
989    name; PCRE does not support this.
990  .  .
991  .  .
992  .SH "MATCHING A SINGLE BYTE"  .SH "MATCHING A SINGLE BYTE"
993  .rs  .rs
994  .sp  .sp
995  Outside a character class, the escape sequence \eC matches any one byte, both  Outside a character class, the escape sequence \eC matches any one byte, both
996  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending  in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
997  characters. The feature is provided in Perl in order to match individual bytes  characters. The feature is provided in Perl in order to match individual bytes
998  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the  in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
999  rest of the string may start with a malformed UTF-8 character. For this reason,  breaks up characters into individual bytes, matching one byte with \eC in UTF-8
1000  the \eC escape sequence is best avoided.  mode means that the rest of the string may start with a malformed UTF-8
1001    character. This has undefined results, because PCRE assumes that it is dealing
1002    with valid UTF-8 strings (and by default it checks this at the start of
1003    processing unless the PCRE_NO_UTF8_CHECK option is used).
1004  .P  .P
1005  PCRE does not allow \eC to appear in lookbehind assertions  PCRE does not allow \eC to appear in lookbehind assertions
1006  .\" HTML <a href="#lookbehind">  .\" HTML <a href="#lookbehind">
1007  .\" </a>  .\" </a>
1008  (described below),  (described below)
1009  .\"  .\"
1010  because in UTF-8 mode this would make it impossible to calculate the length of  in UTF-8 mode, because this would make it impossible to calculate the length of
1011  the lookbehind.  the lookbehind.
1012    .P
1013    In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
1014    way of using it that avoids the problem of malformed UTF-8 characters is to
1015    use a lookahead to check the length of the next character, as in this pattern
1016    (ignore white space and line breaks):
1017    .sp
1018      (?| (?=[\ex00-\ex7f])(\eC) |
1019          (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1020          (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1021          (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1022    .sp
1023    A group that starts with (?| resets the capturing parentheses numbers in each
1024    alternative (see
1025    .\" HTML <a href="#dupsubpatternnumber">
1026    .\" </a>
1027    "Duplicate Subpattern Numbers"
1028    .\"
1029    below). The assertions at the start of each branch check the next UTF-8
1030    character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1031    character's individual bytes are then captured by the appropriate number of
1032    groups.
1033  .  .
1034  .  .
1035  .\" HTML <a name="characterclass"></a>  .\" HTML <a name="characterclass"></a>
# Line 1926  temporarily move the current position ba Line 1972  temporarily move the current position ba
1972  match. If there are insufficient characters before the current position, the  match. If there are insufficient characters before the current position, the
1973  assertion fails.  assertion fails.
1974  .P  .P
1975  PCRE does not allow the \eC escape (which matches a single byte in UTF-8 mode)  In UTF-8 mode, PCRE does not allow the \eC escape (which matches a single byte,
1976  to appear in lookbehind assertions, because it makes it impossible to calculate  even in UTF-8 mode) to appear in lookbehind assertions, because it makes it
1977  the length of the lookbehind. The \eX and \eR escapes, which can match  impossible to calculate the length of the lookbehind. The \eX and \eR escapes,
1978  different numbers of bytes, are also not permitted.  which can match different numbers of bytes, are also not permitted.
1979  .P  .P
1980  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
1981  .\" </a>  .\" </a>
# Line 2297  documentation). If the pattern above is Line 2343  documentation). If the pattern above is
2343  .sp  .sp
2344  the value for the inner capturing parentheses (numbered 2) is "ef", which is  the value for the inner capturing parentheses (numbered 2) is "ef", which is
2345  the last value taken on at the top level. If a capturing subpattern is not  the last value taken on at the top level. If a capturing subpattern is not
2346  matched at the top level, its final value is unset, even if it is (temporarily)  matched at the top level, its final captured value is unset, even if it was
2347  set at a deeper level.  (temporarily) set at a deeper level during the matching process.
2348  .P  .P
2349  If there are more than 15 capturing parentheses in a pattern, PCRE has to  If there are more than 15 capturing parentheses in a pattern, PCRE has to
2350  obtain extra memory to store data during a recursion, which it does by using  obtain extra memory to store data during a recursion, which it does by using
# Line 2318  is the actual recursive call. Line 2364  is the actual recursive call.
2364  .  .
2365  .  .
2366  .\" HTML <a name="recursiondifference"></a>  .\" HTML <a name="recursiondifference"></a>
2367  .SS "Recursion difference from Perl"  .SS "Differences in recursion processing between PCRE and Perl"
2368  .rs  .rs
2369  .sp  .sp
2370  In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  Recursion processing in PCRE differs from Perl in two important ways. In PCRE
2371  treated as an atomic group. That is, once it has matched some of the subject  (like Python, but unlike Perl), a recursive subpattern call is always treated
2372  string, it is never re-entered, even if it contains untried alternatives and  as an atomic group. That is, once it has matched some of the subject string, it
2373  there is a subsequent matching failure. This can be illustrated by the  is never re-entered, even if it contains untried alternatives and there is a
2374  following pattern, which purports to match a palindromic string that contains  subsequent matching failure. This can be illustrated by the following pattern,
2375  an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):  which purports to match a palindromic string that contains an odd number of
2376    characters (for example, "a", "aba", "abcba", "abcdcba"):
2377  .sp  .sp
2378    ^(.|(.)(?1)\e2)$    ^(.|(.)(?1)\e2)$
2379  .sp  .sp
# Line 2387  For example, although "abcba" is correct Line 2434  For example, although "abcba" is correct
2434  PCRE finds the palindrome "aba" at the start, then fails at top level because  PCRE finds the palindrome "aba" at the start, then fails at top level because
2435  the end of the string does not follow. Once again, it cannot jump back into the  the end of the string does not follow. Once again, it cannot jump back into the
2436  recursion to try other alternatives, so the entire match fails.  recursion to try other alternatives, so the entire match fails.
2437    .P
2438    The second way in which PCRE and Perl differ in their recursion processing is
2439    in the handling of captured values. In Perl, when a subpattern is called
2440    recursively or as a subpattern (see the next section), it has no access to any
2441    values that were captured outside the recursion, whereas in PCRE these values
2442    can be referenced. Consider this pattern:
2443    .sp
2444      ^(.)(\e1|a(?2))
2445    .sp
2446    In PCRE, this pattern matches "bab". The first capturing parentheses match "b",
2447    then in the second group, when the back reference \e1 fails to match "b", the
2448    second alternative matches "a" and then recurses. In the recursion, \e1 does
2449    now match "b" and so the whole match succeeds. In Perl, the pattern fails to
2450    match because inside the recursive call \e1 cannot access the externally set
2451    value.
2452  .  .
2453  .  .
2454  .\" HTML <a name="subpatternsassubroutines"></a>  .\" HTML <a name="subpatternsassubroutines"></a>
# Line 2746  pattern fragments that do not contain an Line 2808  pattern fragments that do not contain an
2808  .sp  .sp
2809    A (B(*THEN)C) | D    A (B(*THEN)C) | D
2810  .sp  .sp
2811  If A and B are matched, but there is a failure in C, matching does not  If A and B are matched, but there is a failure in C, matching does not
2812  backtrack into A; instead it moves to the next alternative, that is, D.  backtrack into A; instead it moves to the next alternative, that is, D.
2813  However, if the subpattern containing (*THEN) is given an alternative, it  However, if the subpattern containing (*THEN) is given an alternative, it
2814  behaves differently:  behaves differently:
# Line 2754  behaves differently: Line 2816  behaves differently:
2816    A (B(*THEN)C | (*FAIL)) | D    A (B(*THEN)C | (*FAIL)) | D
2817  .sp  .sp
2818  The effect of (*THEN) is now confined to the inner subpattern. After a failure  The effect of (*THEN) is now confined to the inner subpattern. After a failure
2819  in C, matching moves to (*FAIL), which causes the whole subpattern to fail  in C, matching moves to (*FAIL), which causes the whole subpattern to fail
2820  because there are no more alternatives to try. In this case, matching does now  because there are no more alternatives to try. In this case, matching does now
2821  backtrack into A.  backtrack into A.
2822  .P  .P
2823  Note also that a conditional subpattern is not considered as having two  Note also that a conditional subpattern is not considered as having two
2824  alternatives, because only one is ever used. In other words, the | character in  alternatives, because only one is ever used. In other words, the | character in
2825  a conditional subpattern has a different meaning. Ignoring white space,  a conditional subpattern has a different meaning. Ignoring white space,
2826  consider:  consider:
2827  .sp  .sp
2828    ^.*? (?(?=a) a | b(*THEN)c )    ^.*? (?(?=a) a | b(*THEN)c )
2829  .sp  .sp
2830  If the subject is "ba", this pattern does not match. Because .*? is ungreedy,  If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
2831  it initially matches zero characters. The condition (?=a) then fails, the  it initially matches zero characters. The condition (?=a) then fails, the
2832  character "b" is matched, but "c" is not. At this point, matching does not  character "b" is matched, but "c" is not. At this point, matching does not
2833  backtrack to .*? as might perhaps be expected from the presence of the |  backtrack to .*? as might perhaps be expected from the presence of the |
2834  character. The conditional subpattern is part of the single alternative that  character. The conditional subpattern is part of the single alternative that
2835  comprises the whole pattern, and so the match fails. (If there was a backtrack  comprises the whole pattern, and so the match fails. (If there was a backtrack
2836  into .*?, allowing it to match "b", the match would succeed.)  into .*?, allowing it to match "b", the match would succeed.)
2837  .P  .P
2838  The verbs just described provide four different "strengths" of control when  The verbs just described provide four different "strengths" of control when
# Line 2814  Cambridge CB2 3QH, England. Line 2876  Cambridge CB2 3QH, England.
2876  .rs  .rs
2877  .sp  .sp
2878  .nf  .nf
2879  Last updated: 04 October 2011  Last updated: 19 November 2011
2880  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2881  .fi  .fi

Legend:
Removed from v.716  
changed lines
  Added in v.758

  ViewVC Help
Powered by ViewVC 1.1.5