/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1361 by ph10, Fri Sep 6 17:47:32 2013 UTC revision 1394 by ph10, Sat Nov 9 09:17:20 2013 UTC
# Line 1  Line 1 
1  .TH PCREPATTERN 3 "06 September 2013" "PCRE 8.34"  .TH PCREPATTERN 3 "08 November 2013" "PCRE 8.34"
2  .SH NAME  .SH NAME
3  PCRE - Perl-compatible regular expressions  PCRE - Perl-compatible regular expressions
4  .SH "PCRE REGULAR EXPRESSION DETAILS"  .SH "PCRE REGULAR EXPRESSION DETAILS"
# Line 164  pattern of the form Line 164  pattern of the form
164    (*LIMIT_RECURSION=d)    (*LIMIT_RECURSION=d)
165  .sp  .sp
166  where d is any number of decimal digits. However, the value of the setting must  where d is any number of decimal digits. However, the value of the setting must
167  be less than the value set by the caller of \fBpcre_exec()\fP for it to have  be less than the value set (or defaulted) by the caller of \fBpcre_exec()\fP
168  any effect. In other words, the pattern writer can lower the limit set by the  for it to have any effect. In other words, the pattern writer can lower the
169  programmer, but not raise it. If there is more than one setting of one of these  limits set by the programmer, but not raise them. If there is more than one
170  limits, the lower value is used.  setting of one of these limits, the lower value is used.
171  .  .
172  .  .
173  .SH "EBCDIC CHARACTER CODES"  .SH "EBCDIC CHARACTER CODES"
# Line 300  one of the following escape sequences th Line 300  one of the following escape sequences th
300    \en        linefeed (hex 0A)    \en        linefeed (hex 0A)
301    \er        carriage return (hex 0D)    \er        carriage return (hex 0D)
302    \et        tab (hex 09)    \et        tab (hex 09)
303      \e0dd      character with octal code 0dd
304    \eddd      character with octal code ddd, or back reference    \eddd      character with octal code ddd, or back reference
305      \eo{ddd..} character with octal code ddd..
306    \exhh      character with hex code hh    \exhh      character with hex code hh
307    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
308    \euhhhh    character with hex code hhhh (JavaScript mode only)    \euhhhh    character with hex code hhhh (JavaScript mode only)
# Line 321  byte are inverted. Thus \ecA becomes hex Line 323  byte are inverted. Thus \ecA becomes hex
323  the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other  the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
324  characters also generate different values.  characters also generate different values.
325  .P  .P
 By default, after \ex, from zero to two hexadecimal digits are read (letters  
 can be in upper or lower case). Any number of hexadecimal digits may appear  
 between \ex{ and }, but the character code is constrained as follows:  
 .sp  
   8-bit non-UTF mode    less than 0x100  
   8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint  
   16-bit non-UTF mode   less than 0x10000  
   16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint  
   32-bit non-UTF mode   less than 0x80000000  
   32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint  
 .sp  
 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called  
 "surrogate" codepoints), and 0xffef.  
 .P  
 If characters other than hexadecimal digits appear between \ex{ and }, or if  
 there is no terminating }, this form of escape is not recognized. Instead, the  
 initial \ex will be interpreted as a basic hexadecimal escape, with no  
 following digits, giving a character whose value is zero.  
 .P  
 If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is  
 as just described only when it is followed by two hexadecimal digits.  
 Otherwise, it matches a literal "x" character. In JavaScript mode, support for  
 code points greater than 256 is provided by \eu, which must be followed by  
 four hexadecimal digits; otherwise it matches a literal "u" character.  
 Character codes specified by \eu in JavaScript mode are constrained in the same  
 was as those specified by \ex in non-JavaScript mode.  
 .P  
 Characters whose value is less than 256 can be defined by either of the two  
 syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the  
 way they are handled. For example, \exdc is exactly the same as \ex{dc} (or  
 \eu00dc in JavaScript mode).  
 .P  
326  After \e0 up to two further octal digits are read. If there are fewer than two  After \e0 up to two further octal digits are read. If there are fewer than two
327  digits, just those that are present are used. Thus the sequence \e0\ex\e07  digits, just those that are present are used. Thus the sequence \e0\ex\e07
328  specifies two binary zeros followed by a BEL character (code value 7). Make  specifies two binary zeros followed by a BEL character (code value 7). Make
329  sure you supply two digits after the initial zero if the pattern character that  sure you supply two digits after the initial zero if the pattern character that
330  follows is itself an octal digit.  follows is itself an octal digit.
331  .P  .P
332  The handling of a backslash followed by a digit other than 0 is complicated.  The escape \eo must be followed by a sequence of octal digits, enclosed in
333  Outside a character class, PCRE reads it and any following digits as a decimal  braces. An error occurs if this is not the case. This escape is a recent
334  number. If the number is less than 10, or if there have been at least that many  addition to Perl; it provides way of specifying character code points as octal
335    numbers greater than 0777, and it also allows octal numbers and back references
336    to be unambiguously specified.
337    .P
338    For greater clarity and unambiguity, it is best to avoid following \e by a
339    digit greater than zero. Instead, use \eo{} or \ex{} to specify character
340    numbers, and \eg{} to specify back references. The following paragraphs
341    describe the old, ambiguous syntax.
342    .P
343    The handling of a backslash followed by a digit other than 0 is complicated,
344    and Perl has changed in recent releases, causing PCRE also to change. Outside a
345    character class, PCRE reads the digit and any following digits as a decimal
346    number. If the number is less than 8, or if there have been at least that many
347  previous capturing left parentheses in the expression, the entire sequence is  previous capturing left parentheses in the expression, the entire sequence is
348  taken as a \fIback reference\fP. A description of how this works is given  taken as a \fIback reference\fP. A description of how this works is given
349  .\" HTML <a href="#backreferences">  .\" HTML <a href="#backreferences">
# Line 374  following the discussion of Line 356  following the discussion of
356  parenthesized subpatterns.  parenthesized subpatterns.
357  .\"  .\"
358  .P  .P
359  Inside a character class, or if the decimal number is greater than 9 and there  Inside a character class, or if the decimal number following \e is greater than
360  have not been that many capturing subpatterns, PCRE re-reads up to three octal  7 and there have not been that many capturing subpatterns, PCRE handles \e8 and
361  digits following the backslash, and uses them to generate a data character. Any  \e9 as the literal characters "8" and "9", and otherwise re-reads up to three
362  subsequent digits stand for themselves. The value of the character is  octal digits following the backslash, using them to generate a data character.
363  constrained in the same way as characters specified in hexadecimal.  Any subsequent digits stand for themselves. For example:
 For example:  
364  .sp  .sp
365    \e040   is another way of writing an ASCII space    \e040   is another way of writing an ASCII space
366  .\" JOIN  .\" JOIN
# Line 398  For example: Line 379  For example:
379    \e377   might be a back reference, otherwise    \e377   might be a back reference, otherwise
380              the value 255 (decimal)              the value 255 (decimal)
381  .\" JOIN  .\" JOIN
382    \e81    is either a back reference, or a binary zero    \e81    is either a back reference, or the two
383              followed by the two characters "8" and "1"              characters "8" and "1"
384  .sp  .sp
385  Note that octal values of 100 or greater must not be introduced by a leading  Note that octal values of 100 or greater that are specified using this syntax
386  zero, because no more than three octal digits are ever read.  must not be introduced by a leading zero, because no more than three octal
387    digits are ever read.
388    .P
389    By default, after \ex that is not followed by {, from zero to two hexadecimal
390    digits are read (letters can be in upper or lower case). Any number of
391    hexadecimal digits may appear between \ex{ and }. If a character other than
392    a hexadecimal digit appears between \ex{ and }, or if there is no terminating
393    }, an error occurs.
394  .P  .P
395    If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
396    as just described only when it is followed by two hexadecimal digits.
397    Otherwise, it matches a literal "x" character. In JavaScript mode, support for
398    code points greater than 256 is provided by \eu, which must be followed by
399    four hexadecimal digits; otherwise it matches a literal "u" character.
400    .P
401    Characters whose value is less than 256 can be defined by either of the two
402    syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
403    way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
404    \eu00dc in JavaScript mode).
405    .
406    .
407    .SS "Constraints on character values"
408    .rs
409    .sp
410    Characters that are specified using octal or hexadecimal numbers are
411    limited to certain values, as follows:
412    .sp
413      8-bit non-UTF mode    less than 0x100
414      8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
415      16-bit non-UTF mode   less than 0x10000
416      16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
417      32-bit non-UTF mode   less than 0x100000000
418      32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
419    .sp
420    Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
421    "surrogate" codepoints), and 0xffef.
422    .
423    .
424    .SS "Escape sequences in character classes"
425    .rs
426    .sp
427  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
428  and outside character classes. In addition, inside a character class, \eb is  and outside character classes. In addition, inside a character class, \eb is
429  interpreted as the backspace character (hex 08).  interpreted as the backspace character (hex 08).
# Line 494  classes. They each match one character o Line 514  classes. They each match one character o
514  matching point is at the end of the subject string, all of them fail, because  matching point is at the end of the subject string, all of them fail, because
515  there is no character to match.  there is no character to match.
516  .P  .P
517  For compatibility with Perl, \es does not match the VT character (code 11).  For compatibility with Perl, \es did not used to match the VT character (code
518  This makes it different from the the POSIX "space" class. The \es characters  11), which made it different from the the POSIX "space" class. However, Perl
519  are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is  added VT at release 5.18, and PCRE followed suit at release 8.34. The \es
520  included in a Perl script, \es may match the VT character. In PCRE, it never  characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space (32).
 does.  
521  .P  .P
522  A "word" character is an underscore or any character that is a letter or digit.  A "word" character is an underscore or any character that is a letter or digit.
523  By default, the definition of letters and digits is controlled by PCRE's  By default, the definition of letters and digits is controlled by PCRE's
# Line 524  efficiency reasons. However, if PCRE is Line 543  efficiency reasons. However, if PCRE is
543  and the PCRE_UCP option is set, the behaviour is changed so that Unicode  and the PCRE_UCP option is set, the behaviour is changed so that Unicode
544  properties are used to determine character types, as follows:  properties are used to determine character types, as follows:
545  .sp  .sp
546    \ed  any character that \ep{Nd} matches (decimal digit)    \ed  any character that matches \ep{Nd} (decimal digit)
547    \es  any character that \ep{Z} matches, plus HT, LF, FF, CR    \es  any character that matches \ep{Z} or \eh or \ev
548    \ew  any character that \ep{L} or \ep{N} matches, plus underscore    \ew  any character that matches \ep{L} or \ep{N}, plus underscore
549  .sp  .sp
550  The upper case escapes match the inverse sets of characters. Note that \ed  The upper case escapes match the inverse sets of characters. Note that \ed
551  matches only decimal digits, whereas \ew matches any Unicode digit, as well as  matches only decimal digits, whereas \ew matches any Unicode digit, as well as
# Line 906  the "mark" property always have the "ext Line 925  the "mark" property always have the "ext
925  .sp  .sp
926  As well as the standard Unicode properties described above, PCRE supports four  As well as the standard Unicode properties described above, PCRE supports four
927  more that make it possible to convert traditional escape sequences such as \ew  more that make it possible to convert traditional escape sequences such as \ew
928  and \es and POSIX character classes to use Unicode properties. PCRE uses these  and \es to use Unicode properties. PCRE uses these non-standard, non-Perl
929  non-standard, non-Perl properties internally when PCRE_UCP is set. However,  properties internally when PCRE_UCP is set. However, they may also be used
930  they may also be used explicitly. These properties are:  explicitly. These properties are:
931  .sp  .sp
932    Xan   Any alphanumeric character    Xan   Any alphanumeric character
933    Xps   Any POSIX space character    Xps   Any POSIX space character
# Line 918  they may also be used explicitly. These Line 937  they may also be used explicitly. These
937  Xan matches characters that have either the L (letter) or the N (number)  Xan matches characters that have either the L (letter) or the N (number)
938  property. Xps matches the characters tab, linefeed, vertical tab, form feed, or  property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
939  carriage return, and any other character that has the Z (separator) property.  carriage return, and any other character that has the Z (separator) property.
940  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the  Xsp is the same as Xps; it used to exclude vertical tab, for Perl
941  same characters as Xan, plus underscore.  compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd
942    matches the same characters as Xan, plus underscore.
943  .P  .P
944  There is another non-standard property, Xuc, which matches any character that  There is another non-standard property, Xuc, which matches any character that
945  can be represented by a Universal Character Name in C++ and other programming  can be represented by a Universal Character Name in C++ and other programming
# Line 1215  The minus (hyphen) character can be used Line 1235  The minus (hyphen) character can be used
1235  character class. For example, [d-m] matches any letter between d and m,  character class. For example, [d-m] matches any letter between d and m,
1236  inclusive. If a minus character is required in a class, it must be escaped with  inclusive. If a minus character is required in a class, it must be escaped with
1237  a backslash or appear in a position where it cannot be interpreted as  a backslash or appear in a position where it cannot be interpreted as
1238  indicating a range, typically as the first or last character in the class.  indicating a range, typically as the first or last character in the class, or
1239    immediately after a range. For example, [b-d-z] matches letters in the range b
1240    to d, a hyphen character, or z.
1241  .P  .P
1242  It is not possible to have the literal character "]" as the end character of a  It is not possible to have the literal character "]" as the end character of a
1243  range. A pattern such as [W-]46] is interpreted as a class of two characters  range. A pattern such as [W-]46] is interpreted as a class of two characters
# Line 1225  the end of range, so [W-\e]46] is interp Line 1247  the end of range, so [W-\e]46] is interp
1247  followed by two other characters. The octal or hexadecimal representation of  followed by two other characters. The octal or hexadecimal representation of
1248  "]" can also be used to end a range.  "]" can also be used to end a range.
1249  .P  .P
1250    An error is generated if a POSIX character class (see below) or an escape
1251    sequence other than one that defines a single character appears at a point
1252    where a range ending character is expected. For example, [z-\exff] is valid,
1253    but [A-\ed] and [A-[:digit:]] are not.
1254    .P
1255  Ranges operate in the collating sequence of character values. They can also be  Ranges operate in the collating sequence of character values. They can also be
1256  used for characters specified numerically, for example [\e000-\e037]. Ranges  used for characters specified numerically, for example [\e000-\e037]. Ranges
1257  can include any characters that are valid for the current mode.  can include any characters that are valid for the current mode.
# Line 1290  are: Line 1317  are:
1317    lower    lower case letters    lower    lower case letters
1318    print    printing characters, including space    print    printing characters, including space
1319    punct    printing characters, excluding letters and digits and space    punct    printing characters, excluding letters and digits and space
1320    space    white space (not quite the same as \es)    space    white space (the same as \es from PCRE 8.34)
1321    upper    upper case letters    upper    upper case letters
1322    word     "word" characters (same as \ew)    word     "word" characters (same as \ew)
1323    xdigit   hexadecimal digits    xdigit   hexadecimal digits
1324  .sp  .sp
1325  The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and  The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
1326  space (32). Notice that this list includes the VT character (code 11). This  space (32). "Space" used to be different to \es, which did not include VT, for
1327  makes "space" different to \es, which does not include VT (for Perl  Perl compatibility. However, Perl changed at release 5.18, and PCRE followed at
1328  compatibility).  release 8.34. "Space" and \es now match the same set of characters.
1329  .P  .P
1330  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1331  5.8. Another Perl extension is negation, which is indicated by a ^ character  5.8. Another Perl extension is negation, which is indicated by a ^ character
# Line 1313  supported, and an error is given if they Line 1340  supported, and an error is given if they
1340  By default, in UTF modes, characters with values greater than 128 do not match  By default, in UTF modes, characters with values greater than 128 do not match
1341  any of the POSIX character classes. However, if the PCRE_UCP option is passed  any of the POSIX character classes. However, if the PCRE_UCP option is passed
1342  to \fBpcre_compile()\fP, some of the classes are changed so that Unicode  to \fBpcre_compile()\fP, some of the classes are changed so that Unicode
1343  character properties are used. This is achieved by replacing the POSIX classes  character properties are used. This is achieved by replacing certain POSIX
1344  by other sequences, as follows:  classes by other sequences, as follows:
1345  .sp  .sp
1346    [:alnum:]  becomes  \ep{Xan}    [:alnum:]  becomes  \ep{Xan}
1347    [:alpha:]  becomes  \ep{L}    [:alpha:]  becomes  \ep{L}
# Line 1325  by other sequences, as follows: Line 1352  by other sequences, as follows:
1352    [:upper:]  becomes  \ep{Lu}    [:upper:]  becomes  \ep{Lu}
1353    [:word:]   becomes  \ep{Xwd}    [:word:]   becomes  \ep{Xwd}
1354  .sp  .sp
1355  Negated versions, such as [:^alpha:] use \eP instead of \ep. The other POSIX  Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX
1356  classes are unchanged, and match only characters with code points less than  classes are handled specially in UCP mode:
1357  128.  .TP 10
1358    [:graph:]
1359    This matches characters that have glyphs that mark the page when printed. In
1360    Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
1361    properties, except for:
1362    .sp
1363      U+061C           Arabic Letter Mark
1364      U+180E           Mongolian Vowel Separator
1365      U+2066 - U+2069  Various "isolate"s
1366    .sp
1367    .TP 10
1368    [:print:]
1369    This matches the same characters as [:graph:] plus space characters that are
1370    not controls, that is, characters with the Zs property.
1371    .TP 10
1372    [:punct:]
1373    This matches all characters that have the Unicode P (punctuation) property,
1374    plus those characters whose code points are less than 128 that have the S
1375    (Symbol) property.
1376    .P
1377    The other POSIX classes are unchanged, and match only characters with code
1378    points less than 128.
1379  .  .
1380  .  .
1381  .SH "VERTICAL BAR"  .SH "VERTICAL BAR"
# Line 1547  conditions, Line 1595  conditions,
1595  .\"  .\"
1596  can be made by name as well as by number.  can be made by name as well as by number.
1597  .P  .P
1598  Names consist of up to 32 alphanumeric characters and underscores. Named  Names consist of up to 32 alphanumeric characters and underscores, but must
1599  capturing parentheses are still allocated numbers as well as names, exactly as  start with a non-digit. Named capturing parentheses are still allocated numbers
1600  if the names were not present. The PCRE API provides function calls for  as well as names, exactly as if the names were not present. The PCRE API
1601  extracting the name-to-number translation table from a compiled pattern. There  provides function calls for extracting the name-to-number translation table
1602  is also a convenience function for extracting a captured substring by name.  from a compiled pattern. There is also a convenience function for extracting a
1603    captured substring by name.
1604  .P  .P
1605  By default, a name must be unique within a pattern, but it is possible to relax  By default, a name must be unique within a pattern, but it is possible to relax
1606  this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate  this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
# Line 2283  This makes the fragment independent of t Line 2332  This makes the fragment independent of t
2332  .sp  .sp
2333  Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used  Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
2334  subpattern by name. For compatibility with earlier versions of PCRE, which had  subpattern by name. For compatibility with earlier versions of PCRE, which had
2335  this facility before Perl, the syntax (?(name)...) is also recognized. However,  this facility before Perl, the syntax (?(name)...) is also recognized.
 there is a possible ambiguity with this syntax, because subpattern names may  
 consist entirely of digits. PCRE looks first for a named subpattern; if it  
 cannot find one and the name consists entirely of digits, PCRE looks for a  
 subpattern of that number, which must be greater than zero. Using subpattern  
 names that consist entirely of digits is not recommended.  
2336  .P  .P
2337  Rewriting the above example to use a named subpattern gives this:  Rewriting the above example to use a named subpattern gives this:
2338  .sp  .sp
# Line 3157  Cambridge CB2 3QH, England. Line 3201  Cambridge CB2 3QH, England.
3201  .rs  .rs
3202  .sp  .sp
3203  .nf  .nf
3204  Last updated: 06 September 2013  Last updated: 08 November 2013
3205  Copyright (c) 1997-2013 University of Cambridge.  Copyright (c) 1997-2013 University of Cambridge.
3206  .fi  .fi

Legend:
Removed from v.1361  
changed lines
  Added in v.1394

  ViewVC Help
Powered by ViewVC 1.1.5