/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1370 by ph10, Wed Oct 9 10:18:26 2013 UTC revision 1391 by ph10, Wed Nov 6 16:43:07 2013 UTC
# Line 1  Line 1 
1  .TH PCREPATTERN 3 "08 October 2013" "PCRE 8.34"  .TH PCREPATTERN 3 "05 November 2013" "PCRE 8.34"
2  .SH NAME  .SH NAME
3  PCRE - Perl-compatible regular expressions  PCRE - Perl-compatible regular expressions
4  .SH "PCRE REGULAR EXPRESSION DETAILS"  .SH "PCRE REGULAR EXPRESSION DETAILS"
# Line 164  pattern of the form Line 164  pattern of the form
164    (*LIMIT_RECURSION=d)    (*LIMIT_RECURSION=d)
165  .sp  .sp
166  where d is any number of decimal digits. However, the value of the setting must  where d is any number of decimal digits. However, the value of the setting must
167  be less than the value set by the caller of \fBpcre_exec()\fP for it to have  be less than the value set (or defaulted) by the caller of \fBpcre_exec()\fP
168  any effect. In other words, the pattern writer can lower the limit set by the  for it to have any effect. In other words, the pattern writer can lower the
169  programmer, but not raise it. If there is more than one setting of one of these  limits set by the programmer, but not raise them. If there is more than one
170  limits, the lower value is used.  setting of one of these limits, the lower value is used.
171  .  .
172  .  .
173  .SH "EBCDIC CHARACTER CODES"  .SH "EBCDIC CHARACTER CODES"
# Line 414  limited to certain values, as follows: Line 414  limited to certain values, as follows:
414    8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint    8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
415    16-bit non-UTF mode   less than 0x10000    16-bit non-UTF mode   less than 0x10000
416    16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint    16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
417    32-bit non-UTF mode   less than 0x80000000    32-bit non-UTF mode   less than 0x100000000
418    32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint    32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
419  .sp  .sp
420  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
# Line 543  efficiency reasons. However, if PCRE is Line 543  efficiency reasons. However, if PCRE is
543  and the PCRE_UCP option is set, the behaviour is changed so that Unicode  and the PCRE_UCP option is set, the behaviour is changed so that Unicode
544  properties are used to determine character types, as follows:  properties are used to determine character types, as follows:
545  .sp  .sp
546    \ed  any character that \ep{Nd} matches (decimal digit)    \ed  any character that matches \ep{Nd} (decimal digit)
547    \es  any character that \ep{Z} matches, plus HT, LF, FF, CR    \es  any character that matches \ep{Z} or \eh or \ev
548    \ew  any character that \ep{L} or \ep{N} matches, plus underscore    \ew  any character that matches \ep{L} or \ep{N}, plus underscore
549  .sp  .sp
550  The upper case escapes match the inverse sets of characters. Note that \ed  The upper case escapes match the inverse sets of characters. Note that \ed
551  matches only decimal digits, whereas \ew matches any Unicode digit, as well as  matches only decimal digits, whereas \ew matches any Unicode digit, as well as
# Line 925  the "mark" property always have the "ext Line 925  the "mark" property always have the "ext
925  .sp  .sp
926  As well as the standard Unicode properties described above, PCRE supports four  As well as the standard Unicode properties described above, PCRE supports four
927  more that make it possible to convert traditional escape sequences such as \ew  more that make it possible to convert traditional escape sequences such as \ew
928  and \es and POSIX character classes to use Unicode properties. PCRE uses these  and \es to use Unicode properties. PCRE uses these non-standard, non-Perl
929  non-standard, non-Perl properties internally when PCRE_UCP is set. However,  properties internally when PCRE_UCP is set. However, they may also be used
930  they may also be used explicitly. These properties are:  explicitly. These properties are:
931  .sp  .sp
932    Xan   Any alphanumeric character    Xan   Any alphanumeric character
933    Xps   Any POSIX space character    Xps   Any POSIX space character
# Line 937  they may also be used explicitly. These Line 937  they may also be used explicitly. These
937  Xan matches characters that have either the L (letter) or the N (number)  Xan matches characters that have either the L (letter) or the N (number)
938  property. Xps matches the characters tab, linefeed, vertical tab, form feed, or  property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
939  carriage return, and any other character that has the Z (separator) property.  carriage return, and any other character that has the Z (separator) property.
940  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the  Xsp is the same as Xps; it used to exclude vertical tab, for Perl
941  same characters as Xan, plus underscore.  compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd
942    matches the same characters as Xan, plus underscore.
943  .P  .P
944  There is another non-standard property, Xuc, which matches any character that  There is another non-standard property, Xuc, which matches any character that
945  can be represented by a Universal Character Name in C++ and other programming  can be represented by a Universal Character Name in C++ and other programming
# Line 1309  are: Line 1310  are:
1310    lower    lower case letters    lower    lower case letters
1311    print    printing characters, including space    print    printing characters, including space
1312    punct    printing characters, excluding letters and digits and space    punct    printing characters, excluding letters and digits and space
1313    space    white space (not quite the same as \es)    space    white space (the same as \es from PCRE 8.34)
1314    upper    upper case letters    upper    upper case letters
1315    word     "word" characters (same as \ew)    word     "word" characters (same as \ew)
1316    xdigit   hexadecimal digits    xdigit   hexadecimal digits
# Line 1332  supported, and an error is given if they Line 1333  supported, and an error is given if they
1333  By default, in UTF modes, characters with values greater than 128 do not match  By default, in UTF modes, characters with values greater than 128 do not match
1334  any of the POSIX character classes. However, if the PCRE_UCP option is passed  any of the POSIX character classes. However, if the PCRE_UCP option is passed
1335  to \fBpcre_compile()\fP, some of the classes are changed so that Unicode  to \fBpcre_compile()\fP, some of the classes are changed so that Unicode
1336  character properties are used. This is achieved by replacing the POSIX classes  character properties are used. This is achieved by replacing certain POSIX
1337  by other sequences, as follows:  classes by other sequences, as follows:
1338  .sp  .sp
1339    [:alnum:]  becomes  \ep{Xan}    [:alnum:]  becomes  \ep{Xan}
1340    [:alpha:]  becomes  \ep{L}    [:alpha:]  becomes  \ep{L}
# Line 1344  by other sequences, as follows: Line 1345  by other sequences, as follows:
1345    [:upper:]  becomes  \ep{Lu}    [:upper:]  becomes  \ep{Lu}
1346    [:word:]   becomes  \ep{Xwd}    [:word:]   becomes  \ep{Xwd}
1347  .sp  .sp
1348  Negated versions, such as [:^alpha:] use \eP instead of \ep. The other POSIX  Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX
1349  classes are unchanged, and match only characters with code points less than  classes are handled specially in UCP mode:
1350  128.  .TP 10
1351    [:graph:]
1352    This matches characters that have glyphs that mark the page when printed. In
1353    Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
1354    properties, except for:
1355    .sp
1356      U+061C           Arabic Letter Mark
1357      U+180E           Mongolian Vowel Separator
1358      U+2066 - U+2069  Various "isolate"s
1359    .sp
1360    .TP 10
1361    [:print:]
1362    This matches the same characters as [:graph:] plus space characters that are
1363    not controls, that is, characters with the Zs property.
1364    .TP 10
1365    [:punct:]
1366    This matches all characters that have the Unicode P (punctuation) property,
1367    plus those characters whose code points are less than 128 that have the S
1368    (Symbol) property.
1369    .P
1370    The other POSIX classes are unchanged, and match only characters with code
1371    points less than 128.
1372  .  .
1373  .  .
1374  .SH "VERTICAL BAR"  .SH "VERTICAL BAR"
# Line 3176  Cambridge CB2 3QH, England. Line 3198  Cambridge CB2 3QH, England.
3198  .rs  .rs
3199  .sp  .sp
3200  .nf  .nf
3201  Last updated: 08 October 2013  Last updated: 05 November 2013
3202  Copyright (c) 1997-2013 University of Cambridge.  Copyright (c) 1997-2013 University of Cambridge.
3203  .fi  .fi

Legend:
Removed from v.1370  
changed lines
  Added in v.1391

  ViewVC Help
Powered by ViewVC 1.1.5