/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1391 by ph10, Wed Nov 6 16:43:07 2013 UTC revision 1404 by ph10, Tue Nov 19 15:36:57 2013 UTC
# Line 1  Line 1 
1  .TH PCREPATTERN 3 "05 November 2013" "PCRE 8.34"  .TH PCREPATTERN 3 "12 November 2013" "PCRE 8.34"
2  .SH NAME  .SH NAME
3  PCRE - Perl-compatible regular expressions  PCRE - Perl-compatible regular expressions
4  .SH "PCRE REGULAR EXPRESSION DETAILS"  .SH "PCRE REGULAR EXPRESSION DETAILS"
# Line 80  appearance causes an error. Line 80  appearance causes an error.
80  .SS "Unicode property support"  .SS "Unicode property support"
81  .rs  .rs
82  .sp  .sp
83  Another special sequence that may appear at the start of a pattern is  Another special sequence that may appear at the start of a pattern is (*UCP).
 .sp  
   (*UCP)  
 .sp  
84  This has the same effect as setting the PCRE_UCP option: it causes sequences  This has the same effect as setting the PCRE_UCP option: it causes sequences
85  such as \ed and \ew to use Unicode properties to determine character types,  such as \ed and \ew to use Unicode properties to determine character types,
86  instead of recognizing only characters with codes less than 128 via a lookup  instead of recognizing only characters with codes less than 128 via a lookup
87  table.  table.
88  .  .
89  .  .
90    .SS "Disabling auto-possessification"
91    .rs
92    .sp
93    If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
94    the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making
95    quantifiers possessive when what follows cannot match the repeated item. For
96    example, by default a+b is treated as a++b. For more details, see the
97    .\" HREF
98    \fBpcreapi\fP
99    .\"
100    documentation.
101    .
102    .
103  .SS "Disabling start-up optimizations"  .SS "Disabling start-up optimizations"
104  .rs  .rs
105  .sp  .sp
106  If a pattern starts with (*NO_START_OPT), it has the same effect as setting the  If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
107  PCRE_NO_START_OPTIMIZE option either at compile or matching time.  PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables
108    several optimizations for quickly reaching "no match" results. For more
109    details, see the
110    .\" HREF
111    \fBpcreapi\fP
112    .\"
113    documentation.
114  .  .
115  .  .
116  .\" HTML <a name="newlines"></a>  .\" HTML <a name="newlines"></a>
# Line 257  In a UTF mode, only ASCII numbers and le Line 273  In a UTF mode, only ASCII numbers and le
273  backslash. All other characters (in particular, those whose codepoints are  backslash. All other characters (in particular, those whose codepoints are
274  greater than 127) are treated as literals.  greater than 127) are treated as literals.
275  .P  .P
276  If a pattern is compiled with the PCRE_EXTENDED option, white space in the  If a pattern is compiled with the PCRE_EXTENDED option, most white space in the
277  pattern (other than in a character class) and characters between a # outside  pattern (other than in a character class), and characters between a # outside a
278  a character class and the next newline are ignored. An escaping backslash can  character class and the next newline, inclusive, are ignored. An escaping
279  be used to include a white space or # character as part of the pattern.  backslash can be used to include a white space or # character as part of the
280    pattern.
281  .P  .P
282  If you want to remove the special meaning from a sequence of characters, you  If you want to remove the special meaning from a sequence of characters, you
283  can do so by putting them between \eQ and \eE. This is different from Perl in  can do so by putting them between \eQ and \eE. This is different from Perl in
# Line 300  one of the following escape sequences th Line 317  one of the following escape sequences th
317    \en        linefeed (hex 0A)    \en        linefeed (hex 0A)
318    \er        carriage return (hex 0D)    \er        carriage return (hex 0D)
319    \et        tab (hex 09)    \et        tab (hex 09)
320    \e0dd      character with octal code 0dd    \e0dd      character with octal code 0dd
321    \eddd      character with octal code ddd, or back reference    \eddd      character with octal code ddd, or back reference
322    \eo{ddd..} character with octal code ddd..    \eo{ddd..} character with octal code ddd..
323    \exhh      character with hex code hh    \exhh      character with hex code hh
324    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
325    \euhhhh    character with hex code hhhh (JavaScript mode only)    \euhhhh    character with hex code hhhh (JavaScript mode only)
# Line 329  specifies two binary zeros followed by a Line 346  specifies two binary zeros followed by a
346  sure you supply two digits after the initial zero if the pattern character that  sure you supply two digits after the initial zero if the pattern character that
347  follows is itself an octal digit.  follows is itself an octal digit.
348  .P  .P
349  The escape \eo must be followed by a sequence of octal digits, enclosed in  The escape \eo must be followed by a sequence of octal digits, enclosed in
350  braces. An error occurs if this is not the case. This escape is a recent  braces. An error occurs if this is not the case. This escape is a recent
351  addition to Perl; it provides way of specifying character code points as octal  addition to Perl; it provides way of specifying character code points as octal
352  numbers greater than 0777, and it also allows octal numbers and back references  numbers greater than 0777, and it also allows octal numbers and back references
# Line 418  limited to certain values, as follows: Line 435  limited to certain values, as follows:
435    32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint    32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
436  .sp  .sp
437  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
438  "surrogate" codepoints), and 0xffef.  "surrogate" codepoints), and 0xffef.
439  .  .
440  .  .
441  .SS "Escape sequences in character classes"  .SS "Escape sequences in character classes"
# Line 516  there is no character to match. Line 533  there is no character to match.
533  .P  .P
534  For compatibility with Perl, \es did not used to match the VT character (code  For compatibility with Perl, \es did not used to match the VT character (code
535  11), which made it different from the the POSIX "space" class. However, Perl  11), which made it different from the the POSIX "space" class. However, Perl
536  added VT at release 5.18, and PCRE followed suit at release 8.34. The \es  added VT at release 5.18, and PCRE followed suit at release 8.34. The default
537  characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32).  \es characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space
538    (32), which are defined as white space in the "C" locale. This list may vary if
539    locale-specific matching is taking place; in particular, in some locales the
540    "non-breaking space" character (\exA0) is recognized as white space.
541  .P  .P
542  A "word" character is an underscore or any character that is a letter or digit.  A "word" character is an underscore or any character that is a letter or digit.
543  By default, the definition of letters and digits is controlled by PCRE's  By default, the definition of letters and digits is controlled by PCRE's
# Line 532  in the Line 552  in the
552  \fBpcreapi\fP  \fBpcreapi\fP
553  .\"  .\"
554  page). For example, in a French locale such as "fr_FR" in Unix-like systems,  page). For example, in a French locale such as "fr_FR" in Unix-like systems,
555  or "french" in Windows, some character codes greater than 128 are used for  or "french" in Windows, some character codes greater than 127 are used for
556  accented letters, and these are then matched by \ew. The use of locales with  accented letters, and these are then matched by \ew. The use of locales with
557  Unicode is discouraged.  Unicode is discouraged.
558  .P  .P
559  By default, in a UTF mode, characters with values greater than 128 never match  By default, characters whose code points are greater than 127 never match \ed,
560  \ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain  \es, or \ew, and always match \eD, \eS, and \eW, although this may vary for
561  their original meanings from before UTF support was available, mainly for  characters in the range 128-255 when locale-specific matching is happening.
562  efficiency reasons. However, if PCRE is compiled with Unicode property support,  These escape sequences retain their original meanings from before Unicode
563  and the PCRE_UCP option is set, the behaviour is changed so that Unicode  support was available, mainly for efficiency reasons. If PCRE is compiled with
564  properties are used to determine character types, as follows:  Unicode property support, and the PCRE_UCP option is set, the behaviour is
565    changed so that Unicode properties are used to determine character types, as
566    follows:
567  .sp  .sp
568    \ed  any character that matches \ep{Nd} (decimal digit)    \ed  any character that matches \ep{Nd} (decimal digit)
569    \es  any character that matches \ep{Z} or \eh or \ev    \es  any character that matches \ep{Z} or \eh or \ev
# Line 555  is noticeably slower when PCRE_UCP is se Line 577  is noticeably slower when PCRE_UCP is se
577  .P  .P
578  The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at  The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at
579  release 5.10. In contrast to the other sequences, which match only ASCII  release 5.10. In contrast to the other sequences, which match only ASCII
580  characters by default, these always match certain high-valued codepoints,  characters by default, these always match certain high-valued code points,
581  whether or not PCRE_UCP is set. The horizontal space characters are:  whether or not PCRE_UCP is set. The horizontal space characters are:
582  .sp  .sp
583    U+0009     Horizontal tab (HT)    U+0009     Horizontal tab (HT)
# Line 1235  The minus (hyphen) character can be used Line 1257  The minus (hyphen) character can be used
1257  character class. For example, [d-m] matches any letter between d and m,  character class. For example, [d-m] matches any letter between d and m,
1258  inclusive. If a minus character is required in a class, it must be escaped with  inclusive. If a minus character is required in a class, it must be escaped with
1259  a backslash or appear in a position where it cannot be interpreted as  a backslash or appear in a position where it cannot be interpreted as
1260  indicating a range, typically as the first or last character in the class.  indicating a range, typically as the first or last character in the class, or
1261    immediately after a range. For example, [b-d-z] matches letters in the range b
1262    to d, a hyphen character, or z.
1263  .P  .P
1264  It is not possible to have the literal character "]" as the end character of a  It is not possible to have the literal character "]" as the end character of a
1265  range. A pattern such as [W-]46] is interpreted as a class of two characters  range. A pattern such as [W-]46] is interpreted as a class of two characters
# Line 1245  the end of range, so [W-\e]46] is interp Line 1269  the end of range, so [W-\e]46] is interp
1269  followed by two other characters. The octal or hexadecimal representation of  followed by two other characters. The octal or hexadecimal representation of
1270  "]" can also be used to end a range.  "]" can also be used to end a range.
1271  .P  .P
1272    An error is generated if a POSIX character class (see below) or an escape
1273    sequence other than one that defines a single character appears at a point
1274    where a range ending character is expected. For example, [z-\exff] is valid,
1275    but [A-\ed] and [A-[:digit:]] are not.
1276    .P
1277  Ranges operate in the collating sequence of character values. They can also be  Ranges operate in the collating sequence of character values. They can also be
1278  used for characters specified numerically, for example [\e000-\e037]. Ranges  used for characters specified numerically, for example [\e000-\e037]. Ranges
1279  can include any characters that are valid for the current mode.  can include any characters that are valid for the current mode.
# Line 1315  are: Line 1344  are:
1344    word     "word" characters (same as \ew)    word     "word" characters (same as \ew)
1345    xdigit   hexadecimal digits    xdigit   hexadecimal digits
1346  .sp  .sp
1347  The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and  The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1348  space (32). "Space" used to be different to \es, which did not include VT, for  and space (32). If locale-specific matching is taking place, there may be
1349  Perl compatibility. However, Perl changed at release 5.18, and PCRE followed at  additional space characters. "Space" used to be different to \es, which did not
1350  release 8.34. "Space" and \es now match the same set of characters.  include VT, for Perl compatibility. However, Perl changed at release 5.18, and
1351    PCRE followed at release 8.34. "Space" and \es now match the same set of
1352    characters.
1353  .P  .P
1354  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1355  5.8. Another Perl extension is negation, which is indicated by a ^ character  5.8. Another Perl extension is negation, which is indicated by a ^ character
# Line 1330  matches "1", "2", or any non-digit. PCRE Line 1361  matches "1", "2", or any non-digit. PCRE
1361  syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not  syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1362  supported, and an error is given if they are encountered.  supported, and an error is given if they are encountered.
1363  .P  .P
1364  By default, in UTF modes, characters with values greater than 128 do not match  By default, characters with values greater than 128 do not match any of the
1365  any of the POSIX character classes. However, if the PCRE_UCP option is passed  POSIX character classes. However, if the PCRE_UCP option is passed to
1366  to \fBpcre_compile()\fP, some of the classes are changed so that Unicode  \fBpcre_compile()\fP, some of the classes are changed so that Unicode character
1367  character properties are used. This is achieved by replacing certain POSIX  properties are used. This is achieved by replacing certain POSIX classes by
1368  classes by other sequences, as follows:  other sequences, as follows:
1369  .sp  .sp
1370    [:alnum:]  becomes  \ep{Xan}    [:alnum:]  becomes  \ep{Xan}
1371    [:alpha:]  becomes  \ep{L}    [:alpha:]  becomes  \ep{L}
# Line 1345  classes by other sequences, as follows: Line 1376  classes by other sequences, as follows:
1376    [:upper:]  becomes  \ep{Lu}    [:upper:]  becomes  \ep{Lu}
1377    [:word:]   becomes  \ep{Xwd}    [:word:]   becomes  \ep{Xwd}
1378  .sp  .sp
1379  Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX  Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX
1380  classes are handled specially in UCP mode:  classes are handled specially in UCP mode:
1381  .TP 10  .TP 10
1382  [:graph:]  [:graph:]
1383  This matches characters that have glyphs that mark the page when printed. In  This matches characters that have glyphs that mark the page when printed. In
1384  Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf  Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
1385  properties, except for:  properties, except for:
1386  .sp  .sp
1387    U+061C           Arabic Letter Mark    U+061C           Arabic Letter Mark
1388    U+180E           Mongolian Vowel Separator    U+180E           Mongolian Vowel Separator
1389    U+2066 - U+2069  Various "isolate"s    U+2066 - U+2069  Various "isolate"s
1390  .sp  .sp
1391  .TP 10  .TP 10
1392  [:print:]  [:print:]
1393  This matches the same characters as [:graph:] plus space characters that are  This matches the same characters as [:graph:] plus space characters that are
1394  not controls, that is, characters with the Zs property.  not controls, that is, characters with the Zs property.
1395  .TP 10  .TP 10
1396  [:punct:]  [:punct:]
# Line 1588  conditions, Line 1619  conditions,
1619  .\"  .\"
1620  can be made by name as well as by number.  can be made by name as well as by number.
1621  .P  .P
1622  Names consist of up to 32 alphanumeric characters and underscores. Named  Names consist of up to 32 alphanumeric characters and underscores, but must
1623  capturing parentheses are still allocated numbers as well as names, exactly as  start with a non-digit. Named capturing parentheses are still allocated numbers
1624  if the names were not present. The PCRE API provides function calls for  as well as names, exactly as if the names were not present. The PCRE API
1625  extracting the name-to-number translation table from a compiled pattern. There  provides function calls for extracting the name-to-number translation table
1626  is also a convenience function for extracting a captured substring by name.  from a compiled pattern. There is also a convenience function for extracting a
1627    captured substring by name.
1628  .P  .P
1629  By default, a name must be unique within a pattern, but it is possible to relax  By default, a name must be unique within a pattern, but it is possible to relax
1630  this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate  this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
# Line 1618  for the first (and in this example, the Line 1650  for the first (and in this example, the
1650  matched. This saves searching to find which numbered subpattern it was.  matched. This saves searching to find which numbered subpattern it was.
1651  .P  .P
1652  If you make a back reference to a non-unique named subpattern from elsewhere in  If you make a back reference to a non-unique named subpattern from elsewhere in
1653  the pattern, the subpatterns to which the name refers are checked in the order  the pattern, the subpatterns to which the name refers are checked in the order
1654  in which they appear in the overall pattern. The first one that is set is used  in which they appear in the overall pattern. The first one that is set is used
1655  for the reference. For example, this pattern matches both "foofoo" and  for the reference. For example, this pattern matches both "foofoo" and
1656  "barbar" but not "foobar" or "barfoo":  "barbar" but not "foobar" or "barfoo":
1657  .sp  .sp
1658    (?:(?<n>foo)|(?<n>bar))\k<n>    (?:(?<n>foo)|(?<n>bar))\ek<n>
1659  .sp  .sp
1660  .P  .P
1661  If you make a subroutine call to a non-unique named subpattern, the one that  If you make a subroutine call to a non-unique named subpattern, the one that
# Line 2324  This makes the fragment independent of t Line 2356  This makes the fragment independent of t
2356  .sp  .sp
2357  Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used  Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
2358  subpattern by name. For compatibility with earlier versions of PCRE, which had  subpattern by name. For compatibility with earlier versions of PCRE, which had
2359  this facility before Perl, the syntax (?(name)...) is also recognized. However,  this facility before Perl, the syntax (?(name)...) is also recognized.
 there is a possible ambiguity with this syntax, because subpattern names may  
 consist entirely of digits. PCRE looks first for a named subpattern; if it  
 cannot find one and the name consists entirely of digits, PCRE looks for a  
 subpattern of that number, which must be greater than zero. Using subpattern  
 names that consist entirely of digits is not recommended.  
2360  .P  .P
2361  Rewriting the above example to use a named subpattern gives this:  Rewriting the above example to use a named subpattern gives this:
2362  .sp  .sp
# Line 2751  During matching, when PCRE reaches a cal Line 2778  During matching, when PCRE reaches a cal
2778  called. It is provided with the number of the callout, the position in the  called. It is provided with the number of the callout, the position in the
2779  pattern, and, optionally, one item of data originally supplied by the caller of  pattern, and, optionally, one item of data originally supplied by the caller of
2780  the matching function. The callout function may cause matching to proceed, to  the matching function. The callout function may cause matching to proceed, to
2781  backtrack, or to fail altogether. A complete description of the interface to  backtrack, or to fail altogether.
2782  the callout function is given in the  .P
2783    By default, PCRE implements a number of optimizations at compile time and
2784    matching time, and one side-effect is that sometimes callouts are skipped. If
2785    you need all possible callouts to happen, you need to set options that disable
2786    the relevant optimizations. More details, and a complete description of the
2787    interface to the callout function, are given in the
2788  .\" HREF  .\" HREF
2789  \fBpcrecallout\fP  \fBpcrecallout\fP
2790  .\"  .\"
# Line 3198  Cambridge CB2 3QH, England. Line 3230  Cambridge CB2 3QH, England.
3230  .rs  .rs
3231  .sp  .sp
3232  .nf  .nf
3233  Last updated: 05 November 2013  Last updated: 12 November 2013
3234  Copyright (c) 1997-2013 University of Cambridge.  Copyright (c) 1997-2013 University of Cambridge.
3235  .fi  .fi

Legend:
Removed from v.1391  
changed lines
  Added in v.1404

  ViewVC Help
Powered by ViewVC 1.1.5