/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1401 by ph10, Tue Nov 12 17:44:07 2013 UTC revision 1412 by ph10, Sun Dec 15 17:01:46 2013 UTC
# Line 1  Line 1 
1  .TH PCREPATTERN 3 "12 November 2013" "PCRE 8.34"  .TH PCREPATTERN 3 "03 December 2013" "PCRE 8.34"
2  .SH NAME  .SH NAME
3  PCRE - Perl-compatible regular expressions  PCRE - Perl-compatible regular expressions
4  .SH "PCRE REGULAR EXPRESSION DETAILS"  .SH "PCRE REGULAR EXPRESSION DETAILS"
# Line 90  table. Line 90  table.
90  .SS "Disabling auto-possessification"  .SS "Disabling auto-possessification"
91  .rs  .rs
92  .sp  .sp
93  If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting  If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
94  the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making  the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making
95  quantifiers possessive when what follows cannot match the repeated item. For  quantifiers possessive when what follows cannot match the repeated item. For
96  example, by default a+b is treated as a++b. For more details, see the  example, by default a+b is treated as a++b. For more details, see the
# Line 317  one of the following escape sequences th Line 317  one of the following escape sequences th
317    \en        linefeed (hex 0A)    \en        linefeed (hex 0A)
318    \er        carriage return (hex 0D)    \er        carriage return (hex 0D)
319    \et        tab (hex 09)    \et        tab (hex 09)
320    \e0dd      character with octal code 0dd    \e0dd      character with octal code 0dd
321    \eddd      character with octal code ddd, or back reference    \eddd      character with octal code ddd, or back reference
322    \eo{ddd..} character with octal code ddd..    \eo{ddd..} character with octal code ddd..
323    \exhh      character with hex code hh    \exhh      character with hex code hh
324    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
325    \euhhhh    character with hex code hhhh (JavaScript mode only)    \euhhhh    character with hex code hhhh (JavaScript mode only)
# Line 346  specifies two binary zeros followed by a Line 346  specifies two binary zeros followed by a
346  sure you supply two digits after the initial zero if the pattern character that  sure you supply two digits after the initial zero if the pattern character that
347  follows is itself an octal digit.  follows is itself an octal digit.
348  .P  .P
349  The escape \eo must be followed by a sequence of octal digits, enclosed in  The escape \eo must be followed by a sequence of octal digits, enclosed in
350  braces. An error occurs if this is not the case. This escape is a recent  braces. An error occurs if this is not the case. This escape is a recent
351  addition to Perl; it provides way of specifying character code points as octal  addition to Perl; it provides way of specifying character code points as octal
352  numbers greater than 0777, and it also allows octal numbers and back references  numbers greater than 0777, and it also allows octal numbers and back references
# Line 435  limited to certain values, as follows: Line 435  limited to certain values, as follows:
435    32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint    32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
436  .sp  .sp
437  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
438  "surrogate" codepoints), and 0xffef.  "surrogate" codepoints), and 0xffef.
439  .  .
440  .  .
441  .SS "Escape sequences in character classes"  .SS "Escape sequences in character classes"
# Line 535  For compatibility with Perl, \es did not Line 535  For compatibility with Perl, \es did not
535  11), which made it different from the the POSIX "space" class. However, Perl  11), which made it different from the the POSIX "space" class. However, Perl
536  added VT at release 5.18, and PCRE followed suit at release 8.34. The default  added VT at release 5.18, and PCRE followed suit at release 8.34. The default
537  \es characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space  \es characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space
538  (32), which are defined as white space in the "C" locale. This list may vary if  (32), which are defined as white space in the "C" locale. This list may vary if
539  locale-specific matching is taking place; in particular, in some locales the  locale-specific matching is taking place. For example, in some locales the
540  "non-breaking space" character (\exA0) is recognized as white space.  "non-breaking space" character (\exA0) is recognized as white space, and in
541    others the VT character is not.
542  .P  .P
543  A "word" character is an underscore or any character that is a letter or digit.  A "word" character is an underscore or any character that is a letter or digit.
544  By default, the definition of letters and digits is controlled by PCRE's  By default, the definition of letters and digits is controlled by PCRE's
# Line 1257  The minus (hyphen) character can be used Line 1258  The minus (hyphen) character can be used
1258  character class. For example, [d-m] matches any letter between d and m,  character class. For example, [d-m] matches any letter between d and m,
1259  inclusive. If a minus character is required in a class, it must be escaped with  inclusive. If a minus character is required in a class, it must be escaped with
1260  a backslash or appear in a position where it cannot be interpreted as  a backslash or appear in a position where it cannot be interpreted as
1261  indicating a range, typically as the first or last character in the class, or  indicating a range, typically as the first or last character in the class, or
1262  immediately after a range. For example, [b-d-z] matches letters in the range b  immediately after a range. For example, [b-d-z] matches letters in the range b
1263  to d, a hyphen character, or z.  to d, a hyphen character, or z.
1264  .P  .P
# Line 1312  something AND NOT ...". Line 1313  something AND NOT ...".
1313  The only metacharacters that are recognized in character classes are backslash,  The only metacharacters that are recognized in character classes are backslash,
1314  hyphen (only where it can be interpreted as specifying a range), circumflex  hyphen (only where it can be interpreted as specifying a range), circumflex
1315  (only at the start), opening square bracket (only when it can be interpreted as  (only at the start), opening square bracket (only when it can be interpreted as
1316  introducing a POSIX class name - see the next section), and the terminating  introducing a POSIX class name, or for a special compatibility feature - see
1317  closing square bracket. However, escaping other non-alphanumeric characters  the next two sections), and the terminating closing square bracket. However,
1318  does no harm.  escaping other non-alphanumeric characters does no harm.
1319  .  .
1320  .  .
1321  .SH "POSIX CHARACTER CLASSES"  .SH "POSIX CHARACTER CLASSES"
# Line 1345  are: Line 1346  are:
1346    xdigit   hexadecimal digits    xdigit   hexadecimal digits
1347  .sp  .sp
1348  The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),  The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1349  and space (32). If locale-specific matching is taking place, there may be  and space (32). If locale-specific matching is taking place, the list of space
1350  additional space characters. "Space" used to be different to \es, which did not  characters may be different; there may be fewer or more of them. "Space" used
1351  include VT, for Perl compatibility. However, Perl changed at release 5.18, and  to be different to \es, which did not include VT, for Perl compatibility.
1352  PCRE followed at release 8.34. "Space" and \es now match the same set of  However, Perl changed at release 5.18, and PCRE followed at release 8.34.
1353  characters.  "Space" and \es now match the same set of characters.
1354  .P  .P
1355  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1356  5.8. Another Perl extension is negation, which is indicated by a ^ character  5.8. Another Perl extension is negation, which is indicated by a ^ character
# Line 1376  other sequences, as follows: Line 1377  other sequences, as follows:
1377    [:upper:]  becomes  \ep{Lu}    [:upper:]  becomes  \ep{Lu}
1378    [:word:]   becomes  \ep{Xwd}    [:word:]   becomes  \ep{Xwd}
1379  .sp  .sp
1380  Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX  Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX
1381  classes are handled specially in UCP mode:  classes are handled specially in UCP mode:
1382  .TP 10  .TP 10
1383  [:graph:]  [:graph:]
1384  This matches characters that have glyphs that mark the page when printed. In  This matches characters that have glyphs that mark the page when printed. In
1385  Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf  Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
1386  properties, except for:  properties, except for:
1387  .sp  .sp
1388    U+061C           Arabic Letter Mark    U+061C           Arabic Letter Mark
1389    U+180E           Mongolian Vowel Separator    U+180E           Mongolian Vowel Separator
1390    U+2066 - U+2069  Various "isolate"s    U+2066 - U+2069  Various "isolate"s
1391  .sp  .sp
1392  .TP 10  .TP 10
1393  [:print:]  [:print:]
1394  This matches the same characters as [:graph:] plus space characters that are  This matches the same characters as [:graph:] plus space characters that are
1395  not controls, that is, characters with the Zs property.  not controls, that is, characters with the Zs property.
1396  .TP 10  .TP 10
1397  [:punct:]  [:punct:]
# Line 1402  The other POSIX classes are unchanged, a Line 1403  The other POSIX classes are unchanged, a
1403  points less than 128.  points less than 128.
1404  .  .
1405  .  .
1406    .SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
1407    .rs
1408    .sp
1409    In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
1410    syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of
1411    word". PCRE treats these items as follows:
1412    .sp
1413      [[:<:]]  is converted to  \eb(?=\ew)
1414      [[:>:]]  is converted to  \eb(?<=\ew)
1415    .sp
1416    Only these exact character sequences are recognized. A sequence such as
1417    [a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
1418    not compatible with Perl. It is provided to help migrations from other
1419    environments, and is best not used in any new patterns. Note that \eb matches
1420    at the start and the end of a word (see
1421    .\" HTML <a href="#smallassertions">
1422    .\" </a>
1423    "Simple assertions"
1424    .\"
1425    above), and in a Perl-style pattern the preceding or following character
1426    normally shows which is wanted, without the need for the assertions that are
1427    used above in order to give exactly the POSIX behaviour.
1428    .
1429    .
1430  .SH "VERTICAL BAR"  .SH "VERTICAL BAR"
1431  .rs  .rs
1432  .sp  .sp
# Line 1619  conditions, Line 1644  conditions,
1644  .\"  .\"
1645  can be made by name as well as by number.  can be made by name as well as by number.
1646  .P  .P
1647  Names consist of up to 32 alphanumeric characters and underscores, but must  Names consist of up to 32 alphanumeric characters and underscores, but must
1648  start with a non-digit. Named capturing parentheses are still allocated numbers  start with a non-digit. Named capturing parentheses are still allocated numbers
1649  as well as names, exactly as if the names were not present. The PCRE API  as well as names, exactly as if the names were not present. The PCRE API
1650  provides function calls for extracting the name-to-number translation table  provides function calls for extracting the name-to-number translation table
# Line 1650  for the first (and in this example, the Line 1675  for the first (and in this example, the
1675  matched. This saves searching to find which numbered subpattern it was.  matched. This saves searching to find which numbered subpattern it was.
1676  .P  .P
1677  If you make a back reference to a non-unique named subpattern from elsewhere in  If you make a back reference to a non-unique named subpattern from elsewhere in
1678  the pattern, the subpatterns to which the name refers are checked in the order  the pattern, the subpatterns to which the name refers are checked in the order
1679  in which they appear in the overall pattern. The first one that is set is used  in which they appear in the overall pattern. The first one that is set is used
1680  for the reference. For example, this pattern matches both "foofoo" and  for the reference. For example, this pattern matches both "foofoo" and
1681  "barbar" but not "foobar" or "barfoo":  "barbar" but not "foobar" or "barfoo":
1682  .sp  .sp
1683    (?:(?<n>foo)|(?<n>bar))\k<n>    (?:(?<n>foo)|(?<n>bar))\ek<n>
1684  .sp  .sp
1685  .P  .P
1686  If you make a subroutine call to a non-unique named subpattern, the one that  If you make a subroutine call to a non-unique named subpattern, the one that
# Line 2356  This makes the fragment independent of t Line 2381  This makes the fragment independent of t
2381  .sp  .sp
2382  Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used  Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
2383  subpattern by name. For compatibility with earlier versions of PCRE, which had  subpattern by name. For compatibility with earlier versions of PCRE, which had
2384  this facility before Perl, the syntax (?(name)...) is also recognized.  this facility before Perl, the syntax (?(name)...) is also recognized.
2385  .P  .P
2386  Rewriting the above example to use a named subpattern gives this:  Rewriting the above example to use a named subpattern gives this:
2387  .sp  .sp
# Line 3230  Cambridge CB2 3QH, England. Line 3255  Cambridge CB2 3QH, England.
3255  .rs  .rs
3256  .sp  .sp
3257  .nf  .nf
3258  Last updated: 12 November 2013  Last updated: 03 December 2013
3259  Copyright (c) 1997-2013 University of Cambridge.  Copyright (c) 1997-2013 University of Cambridge.
3260  .fi  .fi

Legend:
Removed from v.1401  
changed lines
  Added in v.1412

  ViewVC Help
Powered by ViewVC 1.1.5