/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 211 by ph10, Thu Aug 9 09:52:43 2007 UTC revision 394 by ph10, Wed Mar 18 16:38:23 2009 UTC
# Line 9  are described in detail below. There is Line 9  are described in detail below. There is
9  .\" HREF  .\" HREF
10  \fBpcresyntax\fP  \fBpcresyntax\fP
11  .\"  .\"
12  page. Perl's regular expressions are described in its own documentation, and  page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
13    also supports some alternative regular expression syntax (which does not
14    conflict with the Perl syntax) in order to provide some compatibility with
15    regular expressions in Python, .NET, and Oniguruma.
16    .P
17    Perl's regular expressions are described in its own documentation, and
18  regular expressions in general are covered in a number of books, some of which  regular expressions in general are covered in a number of books, some of which
19  have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",  have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
20  published by O'Reilly, covers regular expressions in great detail. This  published by O'Reilly, covers regular expressions in great detail. This
# Line 44  discussed in the Line 49  discussed in the
49  page.  page.
50  .  .
51  .  .
52    .SH "NEWLINE CONVENTIONS"
53    .rs
54    .sp
55    PCRE supports five different conventions for indicating line breaks in
56    strings: a single CR (carriage return) character, a single LF (linefeed)
57    character, the two-character sequence CRLF, any of the three preceding, or any
58    Unicode newline sequence. The
59    .\" HREF
60    \fBpcreapi\fP
61    .\"
62    page has
63    .\" HTML <a href="pcreapi.html#newlines">
64    .\" </a>
65    further discussion
66    .\"
67    about newlines, and shows how to set the newline convention in the
68    \fIoptions\fP arguments for the compiling and matching functions.
69    .P
70    It is also possible to specify a newline convention by starting a pattern
71    string with one of the following five sequences:
72    .sp
73      (*CR)        carriage return
74      (*LF)        linefeed
75      (*CRLF)      carriage return, followed by linefeed
76      (*ANYCRLF)   any of the three above
77      (*ANY)       all Unicode newline sequences
78    .sp
79    These override the default and the options given to \fBpcre_compile()\fP. For
80    example, on a Unix system where LF is the default newline sequence, the pattern
81    .sp
82      (*CR)a.b
83    .sp
84    changes the convention to CR. That pattern matches "a\enb" because LF is no
85    longer a newline. Note that these special settings, which are not
86    Perl-compatible, are recognized only at the very start of a pattern, and that
87    they must be in upper case. If more than one of them is present, the last one
88    is used.
89    .P
90    The newline convention does not affect what the \eR escape sequence matches. By
91    default, this is any Unicode newline sequence, for Perl compatibility. However,
92    this can be changed; see the description of \eR in the section entitled
93    .\" HTML <a href="#newlineseq">
94    .\" </a>
95    "Newline sequences"
96    .\"
97    below. A change of \eR setting can be combined with a change of newline
98    convention.
99    .
100    .
101  .SH "CHARACTERS AND METACHARACTERS"  .SH "CHARACTERS AND METACHARACTERS"
102  .rs  .rs
103  .sp  .sp
# Line 153  represents: Line 207  represents:
207    \ecx       "control-x", where x is any character    \ecx       "control-x", where x is any character
208    \ee        escape (hex 1B)    \ee        escape (hex 1B)
209    \ef        formfeed (hex 0C)    \ef        formfeed (hex 0C)
210    \en        newline (hex 0A)    \en        linefeed (hex 0A)
211    \er        carriage return (hex 0D)    \er        carriage return (hex 0D)
212    \et        tab (hex 09)    \et        tab (hex 09)
213    \eddd      character with octal code ddd, or backreference    \eddd      character with octal code ddd, or backreference
# Line 261  parenthesized subpatterns. Line 315  parenthesized subpatterns.
315  .\"  .\"
316  .  .
317  .  .
318    .SS "Absolute and relative subroutine calls"
319    .rs
320    .sp
321    For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
322    a number enclosed either in angle brackets or single quotes, is an alternative
323    syntax for referencing a subpattern as a "subroutine". Details are discussed
324    .\" HTML <a href="#onigurumasubroutines">
325    .\" </a>
326    later.
327    .\"
328    Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
329    synonymous. The former is a back reference; the latter is a subroutine call.
330    .
331    .
332  .SS "Generic character types"  .SS "Generic character types"
333  .rs  .rs
334  .sp  .sp
# Line 296  In UTF-8 mode, characters with values gr Line 364  In UTF-8 mode, characters with values gr
364  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode
365  character property support is available. These sequences retain their original  character property support is available. These sequences retain their original
366  meanings from before UTF-8 support was available, mainly for efficiency  meanings from before UTF-8 support was available, mainly for efficiency
367  reasons.  reasons. Note that this also affects \eb, because it is defined in terms of \ew
368    and \eW.
369  .P  .P
370  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the
371  other sequences, these do match certain high-valued codepoints in UTF-8 mode.  other sequences, these do match certain high-valued codepoints in UTF-8 mode.
# Line 350  accented letters, and these are matched Line 419  accented letters, and these are matched
419  is discouraged.  is discouraged.
420  .  .
421  .  .
422    .\" HTML <a name="newlineseq"></a>
423  .SS "Newline sequences"  .SS "Newline sequences"
424  .rs  .rs
425  .sp  .sp
426  Outside a character class, the escape sequence \eR matches any Unicode newline  Outside a character class, by default, the escape sequence \eR matches any
427  sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \eR is equivalent to  Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \eR is
428  the following:  equivalent to the following:
429  .sp  .sp
430    (?>\er\en|\en|\ex0b|\ef|\er|\ex85)    (?>\er\en|\en|\ex0b|\ef|\er|\ex85)
431  .sp  .sp
# Line 375  are added: LS (line separator, U+2028) a Line 445  are added: LS (line separator, U+2028) a
445  Unicode character property support is not needed for these characters to be  Unicode character property support is not needed for these characters to be
446  recognized.  recognized.
447  .P  .P
448    It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
449    complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF
450    either at compile time or when the pattern is matched. (BSR is an abbrevation
451    for "backslash R".) This can be made the default when PCRE is built; if this is
452    the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option.
453    It is also possible to specify these settings by starting a pattern string with
454    one of the following sequences:
455    .sp
456      (*BSR_ANYCRLF)   CR, LF, or CRLF only
457      (*BSR_UNICODE)   any Unicode newline sequence
458    .sp
459    These override the default and the options given to \fBpcre_compile()\fP, but
460    they can be overridden by options given to \fBpcre_exec()\fP. Note that these
461    special settings, which are not Perl-compatible, are recognized only at the
462    very start of a pattern, and that they must be in upper case. If more than one
463    of them is present, the last one is used. They can be combined with a change of
464    newline convention, for example, a pattern can start with:
465    .sp
466      (*ANY)(*BSR_ANYCRLF)
467    .sp
468  Inside a character class, \eR matches the letter "R".  Inside a character class, \eR matches the letter "R".
469  .  .
470  .  .
# Line 922  alternative in the subpattern. Line 1012  alternative in the subpattern.
1012  .rs  .rs
1013  .sp  .sp
1014  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
1015  PCRE_EXTENDED options can be changed from within the pattern by a sequence of  PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
1016  Perl option letters enclosed between "(?" and ")". The option letters are  the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
1017    The option letters are
1018  .sp  .sp
1019    i  for PCRE_CASELESS    i  for PCRE_CASELESS
1020    m  for PCRE_MULTILINE    m  for PCRE_MULTILINE
# Line 937  PCRE_MULTILINE while unsetting PCRE_DOTA Line 1028  PCRE_MULTILINE while unsetting PCRE_DOTA
1028  permitted. If a letter appears both before and after the hyphen, the option is  permitted. If a letter appears both before and after the hyphen, the option is
1029  unset.  unset.
1030  .P  .P
1031    The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
1032    changed in the same way as the Perl-compatible options by using the characters
1033    J, U and X respectively.
1034    .P
1035  When an option change occurs at top level (that is, not inside subpattern  When an option change occurs at top level (that is, not inside subpattern
1036  parentheses), the change applies to the remainder of the pattern that follows.  parentheses), the change applies to the remainder of the pattern that follows.
1037  If the change is placed right at the start of a pattern, PCRE extracts it into  If the change is placed right at the start of a pattern, PCRE extracts it into
# Line 960  branch is abandoned before the option se Line 1055  branch is abandoned before the option se
1055  option settings happen at compile time. There would be some very weird  option settings happen at compile time. There would be some very weird
1056  behaviour otherwise.  behaviour otherwise.
1057  .P  .P
1058  The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be  \fBNote:\fP There are other PCRE-specific options that can be set by the
1059  changed in the same way as the Perl-compatible options by using the characters  application when the compile or match functions are called. In some cases the
1060  J, U and X respectively.  pattern can contain special leading sequences to override what the application
1061    has set or what has been defaulted. Details are given in the section entitled
1062    .\" HTML <a href="#newlineseq">
1063    .\" </a>
1064    "Newline sequences"
1065    .\"
1066    above.
1067  .  .
1068  .  .
1069  .\" HTML <a name="subpattern"></a>  .\" HTML <a name="subpattern"></a>
# Line 1111  details of the interfaces for handling n Line 1212  details of the interfaces for handling n
1212  \fBpcreapi\fP  \fBpcreapi\fP
1213  .\"  .\"
1214  documentation.  documentation.
1215    .P
1216    \fBWarning:\fP You cannot use different names to distinguish between two
1217    subpatterns with the same number (see the previous section) because PCRE uses
1218    only the numbers when matching.
1219  .  .
1220  .  .
1221  .SH REPETITION  .SH REPETITION
# Line 1159  support is available, \eX{3} matches thr Line 1264  support is available, \eX{3} matches thr
1264  which may be several bytes long (and they may be of different lengths).  which may be several bytes long (and they may be of different lengths).
1265  .P  .P
1266  The quantifier {0} is permitted, causing the expression to behave as if the  The quantifier {0} is permitted, causing the expression to behave as if the
1267  previous item and the quantifier were not present.  previous item and the quantifier were not present. This may be useful for
1268    subpatterns that are referenced as
1269    .\" HTML <a href="#subpatternsassubroutines">
1270    .\" </a>
1271    subroutines
1272    .\"
1273    from elsewhere in the pattern. Items other than subpatterns that have a {0}
1274    quantifier are omitted from the compiled pattern.
1275  .P  .P
1276  For convenience, the three most common quantifiers have single-character  For convenience, the three most common quantifiers have single-character
1277  abbreviations:  abbreviations:
# Line 1942  It matches "abcabc". It does not match " Line 2054  It matches "abcabc". It does not match "
2054  processing option does not affect the called subpattern.  processing option does not affect the called subpattern.
2055  .  .
2056  .  .
2057    .\" HTML <a name="onigurumasubroutines"></a>
2058    .SH "ONIGURUMA SUBROUTINE SYNTAX"
2059    .rs
2060    .sp
2061    For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
2062    a number enclosed either in angle brackets or single quotes, is an alternative
2063    syntax for referencing a subpattern as a subroutine, possibly recursively. Here
2064    are two of the examples used above, rewritten using this syntax:
2065    .sp
2066      (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
2067      (sens|respons)e and \eg'1'ibility
2068    .sp
2069    PCRE supports an extension to Oniguruma: if a number is preceded by a
2070    plus or a minus sign it is taken as a relative reference. For example:
2071    .sp
2072      (abc)(?i:\eg<-1>)
2073    .sp
2074    Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
2075    synonymous. The former is a back reference; the latter is a subroutine call.
2076    .
2077    .
2078  .SH CALLOUTS  .SH CALLOUTS
2079  .rs  .rs
2080  .sp  .sp
# Line 1978  description of the interface to the call Line 2111  description of the interface to the call
2111  documentation.  documentation.
2112  .  .
2113  .  .
2114  .SH "BACTRACKING CONTROL"  .SH "BACKTRACKING CONTROL"
2115  .rs  .rs
2116  .sp  .sp
2117  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
# Line 1987  or removal in a future version of Perl". Line 2120  or removal in a future version of Perl".
2120  production code should be noted to avoid problems during upgrades." The same  production code should be noted to avoid problems during upgrades." The same
2121  remarks apply to the PCRE features described in this section.  remarks apply to the PCRE features described in this section.
2122  .P  .P
2123  Since these verbs are specifically related to backtracking, they can be used  Since these verbs are specifically related to backtracking, most of them can be
2124  only when the pattern is to be matched using \fBpcre_exec()\fP, which uses a  used only when the pattern is to be matched using \fBpcre_exec()\fP, which uses
2125  backtracking algorithm. They cause an error if encountered by  a backtracking algorithm. With the exception of (*FAIL), which behaves like a
2126    failing negative assertion, they cause an error if encountered by
2127  \fBpcre_dfa_exec()\fP.  \fBpcre_dfa_exec()\fP.
2128  .P  .P
2129  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
# Line 2111  Cambridge CB2 3QH, England. Line 2245  Cambridge CB2 3QH, England.
2245  .rs  .rs
2246  .sp  .sp
2247  .nf  .nf
2248  Last updated: 09 August 2007  Last updated: 18 March 2009
2249  Copyright (c) 1997-2007 University of Cambridge.  Copyright (c) 1997-2009 University of Cambridge.
2250  .fi  .fi

Legend:
Removed from v.211  
changed lines
  Added in v.394

  ViewVC Help
Powered by ViewVC 1.1.5