/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 227 by ph10, Tue Aug 21 15:00:15 2007 UTC revision 385 by ph10, Sun Mar 8 16:56:58 2009 UTC
# Line 9  are described in detail below. There is Line 9  are described in detail below. There is
9  .\" HREF  .\" HREF
10  \fBpcresyntax\fP  \fBpcresyntax\fP
11  .\"  .\"
12  page. Perl's regular expressions are described in its own documentation, and  page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
13    also supports some alternative regular expression syntax (which does not
14    conflict with the Perl syntax) in order to provide some compatibility with
15    regular expressions in Python, .NET, and Oniguruma.
16    .P
17    Perl's regular expressions are described in its own documentation, and
18  regular expressions in general are covered in a number of books, some of which  regular expressions in general are covered in a number of books, some of which
19  have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",  have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
20  published by O'Reilly, covers regular expressions in great detail. This  published by O'Reilly, covers regular expressions in great detail. This
# Line 79  example, on a Unix system where LF is th Line 84  example, on a Unix system where LF is th
84  changes the convention to CR. That pattern matches "a\enb" because LF is no  changes the convention to CR. That pattern matches "a\enb" because LF is no
85  longer a newline. Note that these special settings, which are not  longer a newline. Note that these special settings, which are not
86  Perl-compatible, are recognized only at the very start of a pattern, and that  Perl-compatible, are recognized only at the very start of a pattern, and that
87  they must be in upper case.  they must be in upper case. If more than one of them is present, the last one
88    is used.
89    .P
90    The newline convention does not affect what the \eR escape sequence matches. By
91    default, this is any Unicode newline sequence, for Perl compatibility. However,
92    this can be changed; see the description of \eR in the section entitled
93    .\" HTML <a href="#newlineseq">
94    .\" </a>
95    "Newline sequences"
96    .\"
97    below. A change of \eR setting can be combined with a change of newline
98    convention.
99  .  .
100  .  .
101  .SH "CHARACTERS AND METACHARACTERS"  .SH "CHARACTERS AND METACHARACTERS"
# Line 299  parenthesized subpatterns. Line 315  parenthesized subpatterns.
315  .\"  .\"
316  .  .
317  .  .
318    .SS "Absolute and relative subroutine calls"
319    .rs
320    .sp
321    For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
322    a number enclosed either in angle brackets or single quotes, is an alternative
323    syntax for referencing a subpattern as a "subroutine". Details are discussed
324    .\" HTML <a href="#onigurumasubroutines">
325    .\" </a>
326    later.
327    .\"
328    Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
329    synonymous. The former is a back reference; the latter is a subroutine call.
330    .
331    .
332  .SS "Generic character types"  .SS "Generic character types"
333  .rs  .rs
334  .sp  .sp
# Line 388  accented letters, and these are matched Line 418  accented letters, and these are matched
418  is discouraged.  is discouraged.
419  .  .
420  .  .
421    .\" HTML <a name="newlineseq"></a>
422  .SS "Newline sequences"  .SS "Newline sequences"
423  .rs  .rs
424  .sp  .sp
425  Outside a character class, the escape sequence \eR matches any Unicode newline  Outside a character class, by default, the escape sequence \eR matches any
426  sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \eR is equivalent to  Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \eR is
427  the following:  equivalent to the following:
428  .sp  .sp
429    (?>\er\en|\en|\ex0b|\ef|\er|\ex85)    (?>\er\en|\en|\ex0b|\ef|\er|\ex85)
430  .sp  .sp
# Line 413  are added: LS (line separator, U+2028) a Line 444  are added: LS (line separator, U+2028) a
444  Unicode character property support is not needed for these characters to be  Unicode character property support is not needed for these characters to be
445  recognized.  recognized.
446  .P  .P
447    It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
448    complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF
449    either at compile time or when the pattern is matched. (BSR is an abbrevation
450    for "backslash R".) This can be made the default when PCRE is built; if this is
451    the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option.
452    It is also possible to specify these settings by starting a pattern string with
453    one of the following sequences:
454    .sp
455      (*BSR_ANYCRLF)   CR, LF, or CRLF only
456      (*BSR_UNICODE)   any Unicode newline sequence
457    .sp
458    These override the default and the options given to \fBpcre_compile()\fP, but
459    they can be overridden by options given to \fBpcre_exec()\fP. Note that these
460    special settings, which are not Perl-compatible, are recognized only at the
461    very start of a pattern, and that they must be in upper case. If more than one
462    of them is present, the last one is used. They can be combined with a change of
463    newline convention, for example, a pattern can start with:
464    .sp
465      (*ANY)(*BSR_ANYCRLF)
466    .sp
467  Inside a character class, \eR matches the letter "R".  Inside a character class, \eR matches the letter "R".
468  .  .
469  .  .
# Line 960  alternative in the subpattern. Line 1011  alternative in the subpattern.
1011  .rs  .rs
1012  .sp  .sp
1013  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
1014  PCRE_EXTENDED options can be changed from within the pattern by a sequence of  PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
1015  Perl option letters enclosed between "(?" and ")". The option letters are  the pattern by a sequence of Perl option letters enclosed between "(?" and ")".
1016    The option letters are
1017  .sp  .sp
1018    i  for PCRE_CASELESS    i  for PCRE_CASELESS
1019    m  for PCRE_MULTILINE    m  for PCRE_MULTILINE
# Line 975  PCRE_MULTILINE while unsetting PCRE_DOTA Line 1027  PCRE_MULTILINE while unsetting PCRE_DOTA
1027  permitted. If a letter appears both before and after the hyphen, the option is  permitted. If a letter appears both before and after the hyphen, the option is
1028  unset.  unset.
1029  .P  .P
1030    The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
1031    changed in the same way as the Perl-compatible options by using the characters
1032    J, U and X respectively.
1033    .P
1034  When an option change occurs at top level (that is, not inside subpattern  When an option change occurs at top level (that is, not inside subpattern
1035  parentheses), the change applies to the remainder of the pattern that follows.  parentheses), the change applies to the remainder of the pattern that follows.
1036  If the change is placed right at the start of a pattern, PCRE extracts it into  If the change is placed right at the start of a pattern, PCRE extracts it into
# Line 998  branch is abandoned before the option se Line 1054  branch is abandoned before the option se
1054  option settings happen at compile time. There would be some very weird  option settings happen at compile time. There would be some very weird
1055  behaviour otherwise.  behaviour otherwise.
1056  .P  .P
1057  The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be  \fBNote:\fP There are other PCRE-specific options that can be set by the
1058  changed in the same way as the Perl-compatible options by using the characters  application when the compile or match functions are called. In some cases the
1059  J, U and X respectively.  pattern can contain special leading sequences to override what the application
1060    has set or what has been defaulted. Details are given in the section entitled
1061    .\" HTML <a href="#newlineseq">
1062    .\" </a>
1063    "Newline sequences"
1064    .\"
1065    above.
1066  .  .
1067  .  .
1068  .\" HTML <a name="subpattern"></a>  .\" HTML <a name="subpattern"></a>
# Line 1149  details of the interfaces for handling n Line 1211  details of the interfaces for handling n
1211  \fBpcreapi\fP  \fBpcreapi\fP
1212  .\"  .\"
1213  documentation.  documentation.
1214    .P
1215    \fBWarning:\fP You cannot use different names to distinguish between two
1216    subpatterns with the same number (see the previous section) because PCRE uses
1217    only the numbers when matching.
1218  .  .
1219  .  .
1220  .SH REPETITION  .SH REPETITION
# Line 1197  support is available, \eX{3} matches thr Line 1263  support is available, \eX{3} matches thr
1263  which may be several bytes long (and they may be of different lengths).  which may be several bytes long (and they may be of different lengths).
1264  .P  .P
1265  The quantifier {0} is permitted, causing the expression to behave as if the  The quantifier {0} is permitted, causing the expression to behave as if the
1266  previous item and the quantifier were not present.  previous item and the quantifier were not present. This may be useful for
1267    subpatterns that are referenced as
1268    .\" HTML <a href="#subpatternsassubroutines">
1269    .\" </a>
1270    subroutines
1271    .\"
1272    from elsewhere in the pattern. Items other than subpatterns that have a {0}
1273    quantifier are omitted from the compiled pattern.
1274  .P  .P
1275  For convenience, the three most common quantifiers have single-character  For convenience, the three most common quantifiers have single-character
1276  abbreviations:  abbreviations:
# Line 1980  It matches "abcabc". It does not match " Line 2053  It matches "abcabc". It does not match "
2053  processing option does not affect the called subpattern.  processing option does not affect the called subpattern.
2054  .  .
2055  .  .
2056    .\" HTML <a name="onigurumasubroutines"></a>
2057    .SH "ONIGURUMA SUBROUTINE SYNTAX"
2058    .rs
2059    .sp
2060    For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
2061    a number enclosed either in angle brackets or single quotes, is an alternative
2062    syntax for referencing a subpattern as a subroutine, possibly recursively. Here
2063    are two of the examples used above, rewritten using this syntax:
2064    .sp
2065      (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
2066      (sens|respons)e and \eg'1'ibility
2067    .sp
2068    PCRE supports an extension to Oniguruma: if a number is preceded by a
2069    plus or a minus sign it is taken as a relative reference. For example:
2070    .sp
2071      (abc)(?i:\eg<-1>)
2072    .sp
2073    Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
2074    synonymous. The former is a back reference; the latter is a subroutine call.
2075    .
2076    .
2077  .SH CALLOUTS  .SH CALLOUTS
2078  .rs  .rs
2079  .sp  .sp
# Line 2016  description of the interface to the call Line 2110  description of the interface to the call
2110  documentation.  documentation.
2111  .  .
2112  .  .
2113  .SH "BACTRACKING CONTROL"  .SH "BACKTRACKING CONTROL"
2114  .rs  .rs
2115  .sp  .sp
2116  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
# Line 2025  or removal in a future version of Perl". Line 2119  or removal in a future version of Perl".
2119  production code should be noted to avoid problems during upgrades." The same  production code should be noted to avoid problems during upgrades." The same
2120  remarks apply to the PCRE features described in this section.  remarks apply to the PCRE features described in this section.
2121  .P  .P
2122  Since these verbs are specifically related to backtracking, they can be used  Since these verbs are specifically related to backtracking, most of them can be
2123  only when the pattern is to be matched using \fBpcre_exec()\fP, which uses a  used only when the pattern is to be matched using \fBpcre_exec()\fP, which uses
2124  backtracking algorithm. They cause an error if encountered by  a backtracking algorithm. With the exception of (*FAIL), which behaves like a
2125    failing negative assertion, they cause an error if encountered by
2126  \fBpcre_dfa_exec()\fP.  \fBpcre_dfa_exec()\fP.
2127  .P  .P
2128  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
# Line 2149  Cambridge CB2 3QH, England. Line 2244  Cambridge CB2 3QH, England.
2244  .rs  .rs
2245  .sp  .sp
2246  .nf  .nf
2247  Last updated: 21 August 2007  Last updated: 08 March 2009
2248  Copyright (c) 1997-2007 University of Cambridge.  Copyright (c) 1997-2009 University of Cambridge.
2249  .fi  .fi

Legend:
Removed from v.227  
changed lines
  Added in v.385

  ViewVC Help
Powered by ViewVC 1.1.5