/[pcre]/code/trunk/pcre.3
ViewVC logotype

Diff of /code/trunk/pcre.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 36 by nigel, Sat Feb 24 21:39:05 2007 UTC revision 37 by nigel, Sat Feb 24 21:39:09 2007 UTC
# Line 66  The PCRE library is a set of functions t Line 66  The PCRE library is a set of functions t
66  pattern matching using the same syntax and semantics as Perl 5, with just a few  pattern matching using the same syntax and semantics as Perl 5, with just a few
67  differences (see below). The current implementation corresponds to Perl 5.005.  differences (see below). The current implementation corresponds to Perl 5.005.
68    
69  PCRE has its own native API, which is described in this man page. There is also  PCRE has its own native API, which is described in this document. There is also
70  a set of wrapper functions that correspond to the POSIX API. See  a set of wrapper functions that correspond to the POSIX API. These are
71  \fBpcreposix (3)\fR.  described in the \fBpcreposix\fR documentation.
72    
73    The native API function prototypes are defined in the header file \fBpcre.h\fR,
74    and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be
75    accessed by adding \fB-lpcre\fR to the command for linking an application which
76    calls it.
77    
78  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
79  are used for compiling and matching regular expressions, while  are used for compiling and matching regular expressions, while
# Line 237  and is sufficient for many applications. Line 242  and is sufficient for many applications.
242    
243  An alternative set of tables can, however, be supplied. Such tables are built  An alternative set of tables can, however, be supplied. Such tables are built
244  by calling the \fBpcre_maketables()\fR function, which has no arguments, in the  by calling the \fBpcre_maketables()\fR function, which has no arguments, in the
245  relevant locale. The result can then be passed to \fBpcre_compile()\ as often  relevant locale. The result can then be passed to \fBpcre_compile()\fR as often
246  as necessary. For example, to build and use tables that are appropriate for the  as necessary. For example, to build and use tables that are appropriate for the
247  French locale (where accented characters with codes greater than 128 are  French locale (where accented characters with codes greater than 128 are
248  treated as letters), the following code could be used:  treated as letters), the following code could be used:
# Line 276  string. If there is a fixed first charac Line 281  string. If there is a fixed first charac
281  (cat|cow|coyote), then it is returned in the integer pointed to by  (cat|cow|coyote), then it is returned in the integer pointed to by
282  \fIfirstcharptr\fR. Otherwise, if either  \fIfirstcharptr\fR. Otherwise, if either
283    
284    (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch  (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
285        starts with "^", or  starts with "^", or
286    
287    (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set  (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
288        (if it were set, the pattern would be anchored),  (if it were set, the pattern would be anchored),
289    
290  then -1 is returned, indicating that the pattern matches only at the  then -1 is returned, indicating that the pattern matches only at the
291  start of a subject string or after any "\\n" within the string. Otherwise -2 is  start of a subject string or after any "\\n" within the string. Otherwise -2 is
# Line 298  unused bits must be zero. However, if a Line 303  unused bits must be zero. However, if a
303  PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it  PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it
304  cannot be made unachored at matching time.  cannot be made unachored at matching time.
305    
306  There are also two further options that can be set only at matching time:  There are also three further options that can be set only at matching time:
307    
308    PCRE_NOTBOL    PCRE_NOTBOL
309    
# Line 313  should not match it nor (except in multi Line 318  should not match it nor (except in multi
318  it. Setting this without PCRE_MULTILINE (at compile time) causes dollar never  it. Setting this without PCRE_MULTILINE (at compile time) causes dollar never
319  to match.  to match.
320    
321      PCRE_NOTEMPTY
322    
323    An empty string is not considered to be a valid match if this option is set. If
324    there are alternatives in the pattern, they are tried. If all the alternatives
325    match the empty string, the entire match fails. For example, if the pattern
326    
327      a?b?
328    
329    is applied to a string not beginning with "a" or "b", it matches the empty
330    string at the start of the subject. With PCRE_NOTEMPTY set, this match is not
331    valid, so PCRE searches further into the string for occurrences of "a" or "b".
332    Perl has no direct equivalent of this option, but it makes a special case of
333    a pattern match of the empty string within its \fBsplit()\fR function. Using
334    PCRE_NOTEMPTY it is possible to emulate this behaviour.
335    
336  The subject string is passed as a pointer in \fIsubject\fR, a length in  The subject string is passed as a pointer in \fIsubject\fR, a length in
337  \fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern  \fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern
338  string, it may contain binary zero characters. When the starting offset is  string, it may contain binary zero characters. When the starting offset is
# Line 572  meaning is faulted. Line 592  meaning is faulted.
592  inverted, that is, by default they are not greedy, but if followed by a  inverted, that is, by default they are not greedy, but if followed by a
593  question mark they are.  question mark they are.
594    
595    (e) PCRE_ANCHORED can be used to force a pattern to be tried only at the start
596    of the subject.
597    
598    (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for
599    \fBpcre_exec()\fR have no Perl equivalents.
600    
601    
602  .SH REGULAR EXPRESSION DETAILS  .SH REGULAR EXPRESSION DETAILS
603  The syntax and semantics of the regular expressions supported by PCRE are  The syntax and semantics of the regular expressions supported by PCRE are
# Line 1240  Several assertions (of any sort) may occ Line 1266  Several assertions (of any sort) may occ
1266    
1267    (?<=\\d{3})(?<!999)foo    (?<=\\d{3})(?<!999)foo
1268    
1269  matches "foo" preceded by three digits that are not "999". Furthermore,  matches "foo" preceded by three digits that are not "999". Notice that each of
1270  assertions can be nested in any combination. For example,  the assertions is applied independently at the same point in the subject
1271    string. First there is a check that the previous three characters are all
1272    digits, then there is a check that the same three characters are not "999".
1273    This pattern does \fInot\fR match "foo" preceded by six characters, the first
1274    of which are digits and the last three of which are not "999". For example, it
1275    doesn't match "123abcfoo". A pattern to do that is
1276    
1277      (?<=\\d{3}...)(?<!999)foo
1278    
1279    This time the first assertion looks at the preceding six characters, checking
1280    that the first three are digits, and then the second assertion checks that the
1281    preceding three characters are not "999".
1282    
1283    Assertions can be nested in any combination. For example,
1284    
1285    (?<=(?<!foo)bar)baz    (?<=(?<!foo)bar)baz
1286    
1287  matches an occurrence of "baz" that is preceded by "bar" which in turn is not  matches an occurrence of "baz" that is preceded by "bar" which in turn is not
1288  preceded by "foo".  preceded by "foo", while
1289    
1290      (?<=\\d{3}(?!999)...)foo
1291    
1292    is another pattern which matches "foo" preceded by three digits and any three
1293    characters that are not "999".
1294    
1295  Assertion subpatterns are not capturing subpatterns, and may not be repeated,  Assertion subpatterns are not capturing subpatterns, and may not be repeated,
1296  because it makes no sense to assert the same thing several times. If any kind  because it makes no sense to assert the same thing several times. If any kind
# Line 1398  because the . metacharacter does not the Line 1442  because the . metacharacter does not the
1442  string contains newlines, the pattern may match from the character immediately  string contains newlines, the pattern may match from the character immediately
1443  following one of them instead of from the very start. For example, the pattern  following one of them instead of from the very start. For example, the pattern
1444    
1445     (.*) second    (.*) second
1446    
1447  matches the subject "first\\nand second" (where \\n stands for a newline  matches the subject "first\\nand second" (where \\n stands for a newline
1448  character) with the first captured substring being "and". In order to do this,  character) with the first captured substring being "and". In order to do this,
# Line 1409  newlines, the best performance is obtain Line 1453  newlines, the best performance is obtain
1453  the pattern with ^.* to indicate explicit anchoring. That saves PCRE from  the pattern with ^.* to indicate explicit anchoring. That saves PCRE from
1454  having to scan along the subject looking for a newline to restart at.  having to scan along the subject looking for a newline to restart at.
1455    
1456    Beware of patterns that contain nested indefinite repeats. These can take a
1457    long time to run when applied to a string that does not match. Consider the
1458    pattern fragment
1459    
1460      (a+)*
1461    
1462    This can match "aaaa" in 33 different ways, and this number increases very
1463    rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4
1464    times, and for each of those cases other than 0, the + repeats can match
1465    different numbers of times.) When the remainder of the pattern is such that the
1466    entire match is going to fail, PCRE has in principle to try every possible
1467    variation, and this can take an extremely long time.
1468    
1469    An optimization catches some of the more simple cases such as
1470    
1471      (a+)*b
1472    
1473    where a literal character follows. Before embarking on the standard matching
1474    procedure, PCRE checks that there is a "b" later in the subject string, and if
1475    there is not, it fails the match immediately. However, when there is no
1476    following literal this optimization cannot be used. You can see the difference
1477    by comparing the behaviour of
1478    
1479      (a+)*\\d
1480    
1481    with the pattern above. The former gives a failure almost instantly when
1482    applied to a whole line of "a" characters, whereas the latter takes an
1483    appreciable time with strings longer than about 20 characters.
1484    
1485  .SH AUTHOR  .SH AUTHOR
1486  Philip Hazel <ph10@cam.ac.uk>  Philip Hazel <ph10@cam.ac.uk>
1487  .br  .br
# Line 1420  Cambridge CB2 3QG, England. Line 1493  Cambridge CB2 3QG, England.
1493  .br  .br
1494  Phone: +44 1223 334714  Phone: +44 1223 334714
1495    
1496  Last updated: 10 June 1999  Last updated: 29 July 1999
1497  .br  .br
1498  Copyright (c) 1997-1999 University of Cambridge.  Copyright (c) 1997-1999 University of Cambridge.

Legend:
Removed from v.36  
changed lines
  Added in v.37

  ViewVC Help
Powered by ViewVC 1.1.5