/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 518 by ph10, Tue May 18 15:47:01 2010 UTC revision 551 by ph10, Sun Oct 10 17:33:07 2010 UTC
# Line 42  in the main Line 42  in the main
42  .\"  .\"
43  page.  page.
44  .P  .P
45  Another special sequence that may appear at the start of a pattern or in  Another special sequence that may appear at the start of a pattern or in
46  combination with (*UTF8) is:  combination with (*UTF8) is:
47  .sp  .sp
48    (*UCP)    (*UCP)
49  .sp  .sp
50  This has the same effect as setting the PCRE_UCP option: it causes sequences  This has the same effect as setting the PCRE_UCP option: it causes sequences
51  such as \ed and \ew to use Unicode properties to determine character types,  such as \ed and \ew to use Unicode properties to determine character types,
52  instead of recognizing only characters with codes less than 128 via a lookup  instead of recognizing only characters with codes less than 128 via a lookup
53  table.  table.
54  .P  .P
55  The remainder of this document discusses the patterns that are supported by  The remainder of this document discusses the patterns that are supported by
# Line 367  Another use of backslash is for specifyi Line 367  Another use of backslash is for specifyi
367    \ew     any "word" character    \ew     any "word" character
368    \eW     any "non-word" character    \eW     any "non-word" character
369  .sp  .sp
370  There is also the single sequence \eN, which matches a non-newline character.  There is also the single sequence \eN, which matches a non-newline character.
371  This is the same as  This is the same as
372  .\" HTML <a href="#fullstopdot">  .\" HTML <a href="#fullstopdot">
373  .\" </a>  .\" </a>
374  the "." metacharacter  the "." metacharacter
375  .\"  .\"
376  when PCRE_DOTALL is not set.  when PCRE_DOTALL is not set.
377  .P  .P
# Line 408  Unicode is discouraged. Line 408  Unicode is discouraged.
408  By default, in UTF-8 mode, characters with values greater than 128 never match  By default, in UTF-8 mode, characters with values greater than 128 never match
409  \ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain  \ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain
410  their original meanings from before UTF-8 support was available, mainly for  their original meanings from before UTF-8 support was available, mainly for
411  efficiency reasons. However, if PCRE is compiled with Unicode property support,  efficiency reasons. However, if PCRE is compiled with Unicode property support,
412  and the PCRE_UCP option is set, the behaviour is changed so that Unicode  and the PCRE_UCP option is set, the behaviour is changed so that Unicode
413  properties are used to determine character types, as follows:  properties are used to determine character types, as follows:
414  .sp  .sp
# Line 417  properties are used to determine charact Line 417  properties are used to determine charact
417    \ew  any character that \ep{L} or \ep{N} matches, plus underscore    \ew  any character that \ep{L} or \ep{N} matches, plus underscore
418  .sp  .sp
419  The upper case escapes match the inverse sets of characters. Note that \ed  The upper case escapes match the inverse sets of characters. Note that \ed
420  matches only decimal digits, whereas \ew matches any Unicode digit, as well as  matches only decimal digits, whereas \ew matches any Unicode digit, as well as
421  any Unicode letter, and underscore. Note also that PCRE_UCP affects \eb, and  any Unicode letter, and underscore. Note also that PCRE_UCP affects \eb, and
422  \eB because they are defined in terms of \ew and \eW. Matching these sequences  \eB because they are defined in terms of \ew and \eW. Matching these sequences
423  is noticeably slower when PCRE_UCP is set.  is noticeably slower when PCRE_UCP is set.
424  .P  .P
425  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the
426  other sequences, which match only ASCII characters by default, these always  other sequences, which match only ASCII characters by default, these always
427  match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is  match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is
428  set. The horizontal space characters are:  set. The horizontal space characters are:
429  .sp  .sp
430    U+0009     Horizontal tab    U+0009     Horizontal tab
# Line 527  The extra escape sequences are: Line 527  The extra escape sequences are:
527  The property names represented by \fIxx\fP above are limited to the Unicode  The property names represented by \fIxx\fP above are limited to the Unicode
528  script names, the general category properties, "Any", which matches any  script names, the general category properties, "Any", which matches any
529  character (including newline), and some special PCRE properties (described  character (including newline), and some special PCRE properties (described
530  in the  in the
531  .\" HTML <a href="#extraprops">  .\" HTML <a href="#extraprops">
532  .\" </a>  .\" </a>
533  next section).  next section).
534  .\"  .\"
535  Other Perl properties such as "InMusicalSymbols" are not currently supported by  Other Perl properties such as "InMusicalSymbols" are not currently supported by
536  PCRE. Note that \eP{Any} does not match any characters, so always causes a  PCRE. Note that \eP{Any} does not match any characters, so always causes a
# Line 741  non-UTF-8 mode \eX matches any one chara Line 741  non-UTF-8 mode \eX matches any one chara
741  Matching characters by Unicode property is not fast, because PCRE has to search  Matching characters by Unicode property is not fast, because PCRE has to search
742  a structure that contains data for over fifteen thousand characters. That is  a structure that contains data for over fifteen thousand characters. That is
743  why the traditional escape sequences such as \ed and \ew do not use Unicode  why the traditional escape sequences such as \ed and \ew do not use Unicode
744  properties in PCRE by default, though you can make them do so by setting the  properties in PCRE by default, though you can make them do so by setting the
745  PCRE_UCP option for \fBpcre_compile()\fP or by starting the pattern with  PCRE_UCP option for \fBpcre_compile()\fP or by starting the pattern with
746  (*UCP).  (*UCP).
747  .  .
# Line 750  PCRE_UCP option for \fBpcre_compile()\fP Line 750  PCRE_UCP option for \fBpcre_compile()\fP
750  .SS PCRE's additional properties  .SS PCRE's additional properties
751  .rs  .rs
752  .sp  .sp
753  As well as the standard Unicode properties described in the previous  As well as the standard Unicode properties described in the previous
754  section, PCRE supports four more that make it possible to convert traditional  section, PCRE supports four more that make it possible to convert traditional
755  escape sequences such as \ew and \es and POSIX character classes to use Unicode  escape sequences such as \ew and \es and POSIX character classes to use Unicode
756  properties. PCRE uses these non-standard, non-Perl properties internally when  properties. PCRE uses these non-standard, non-Perl properties internally when
757  PCRE_UCP is set. They are:  PCRE_UCP is set. They are:
# Line 761  PCRE_UCP is set. They are: Line 761  PCRE_UCP is set. They are:
761    Xsp   Any Perl space character    Xsp   Any Perl space character
762    Xwd   Any Perl "word" character    Xwd   Any Perl "word" character
763  .sp  .sp
764  Xan matches characters that have either the L (letter) or the N (number)  Xan matches characters that have either the L (letter) or the N (number)
765  property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or  property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or
766  carriage return, and any other character that has the Z (separator) property.  carriage return, and any other character that has the Z (separator) property.
767  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
768  same characters as Xan, plus underscore.  same characters as Xan, plus underscore.
769  .  .
770  .  .
# Line 825  The backslashed assertions are: Line 825  The backslashed assertions are:
825    \eG     matches at the first matching position in the subject    \eG     matches at the first matching position in the subject
826  .sp  .sp
827  Inside a character class, \eb has a different meaning; it matches the backspace  Inside a character class, \eb has a different meaning; it matches the backspace
828  character. If any other of these assertions appears in a character class, by  character. If any other of these assertions appears in a character class, by
829  default it matches the corresponding literal character (for example, \eB  default it matches the corresponding literal character (for example, \eB
830  matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid  matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
831  escape sequence" error is generated instead.  escape sequence" error is generated instead.
# Line 946  The handling of dot is entirely independ Line 946  The handling of dot is entirely independ
946  dollar, the only relationship being that they both involve newlines. Dot has no  dollar, the only relationship being that they both involve newlines. Dot has no
947  special meaning in a character class.  special meaning in a character class.
948  .P  .P
949  The escape sequence \eN always behaves as a dot does when PCRE_DOTALL is not  The escape sequence \eN always behaves as a dot does when PCRE_DOTALL is not
950  set. In other words, it matches any one character except one that signifies the  set. In other words, it matches any one character except one that signifies the
951  end of a line.  end of a line.
952  .  .
953  .  .
# Line 1102  supported, and an error is given if they Line 1102  supported, and an error is given if they
1102  .P  .P
1103  By default, in UTF-8 mode, characters with values greater than 128 do not match  By default, in UTF-8 mode, characters with values greater than 128 do not match
1104  any of the POSIX character classes. However, if the PCRE_UCP option is passed  any of the POSIX character classes. However, if the PCRE_UCP option is passed
1105  to \fBpcre_compile()\fP, some of the classes are changed so that Unicode  to \fBpcre_compile()\fP, some of the classes are changed so that Unicode
1106  character properties are used. This is achieved by replacing the POSIX classes  character properties are used. This is achieved by replacing the POSIX classes
1107  by other sequences, as follows:  by other sequences, as follows:
1108  .sp  .sp
1109    [:alnum:]  becomes  \ep{Xan}    [:alnum:]  becomes  \ep{Xan}
1110    [:alpha:]  becomes  \ep{L}    [:alpha:]  becomes  \ep{L}
1111    [:blank:]  becomes  \eh    [:blank:]  becomes  \eh
1112    [:digit:]  becomes  \ep{Nd}    [:digit:]  becomes  \ep{Nd}
1113    [:lower:]  becomes  \ep{Ll}    [:lower:]  becomes  \ep{Ll}
1114    [:space:]  becomes  \ep{Xps}    [:space:]  becomes  \ep{Xps}
1115    [:upper:]  becomes  \ep{Lu}    [:upper:]  becomes  \ep{Lu}
1116    [:word:]   becomes  \ep{Xwd}    [:word:]   becomes  \ep{Xwd}
1117  .sp  .sp
# Line 2630  matching name is found, normal "bumpalon Line 2630  matching name is found, normal "bumpalon
2630  .sp  .sp
2631    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2632  .sp  .sp
2633  This verb causes a skip to the next alternation if the rest of the pattern does  This verb causes a skip to the next alternation in the innermost enclosing
2634  not match. That is, it cancels pending backtracking, but only within the  group if the rest of the pattern does not match. That is, it cancels pending
2635  current alternation. Its name comes from the observation that it can be used  backtracking, but only within the current alternation. Its name comes from the
2636  for a pattern-based if-then-else block:  observation that it can be used for a pattern-based if-then-else block:
2637  .sp  .sp
2638    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2639  .sp  .sp
# Line 2644  behaviour of (*THEN:NAME) is exactly the Line 2644  behaviour of (*THEN:NAME) is exactly the
2644  overall match fails. If (*THEN) is not directly inside an alternation, it acts  overall match fails. If (*THEN) is not directly inside an alternation, it acts
2645  like (*PRUNE).  like (*PRUNE).
2646  .  .
2647    .P
2648    The above verbs provide four different "strengths" of control when subsequent
2649    matching fails. (*THEN) is the weakest, carrying on the match at the next
2650    alternation. (*PRUNE) comes next, failing the match at the current starting
2651    position, but allowing an advance to the next character (for an unanchored
2652    pattern). (*SKIP) is similar, except that the advance may be more than one
2653    character. (*COMMIT) is the strongest, causing the entire match to fail.
2654    .P
2655    If more than one is present in a pattern, the "stongest" one wins. For example,
2656    consider this pattern, where A, B, etc. are complex pattern fragments:
2657    .sp
2658      (A(*COMMIT)B(*THEN)C|D)
2659    .sp
2660    Once A has matched, PCRE is committed to this match, at the current starting
2661    position. If subsequently B matches, but C does not, the normal (*THEN) action
2662    of trying the next alternation (that is, D) does not happen because (*COMMIT)
2663    overrides.
2664    .
2665  .  .
2666  .SH "SEE ALSO"  .SH "SEE ALSO"
2667  .rs  .rs
# Line 2666  Cambridge CB2 3QH, England. Line 2684  Cambridge CB2 3QH, England.
2684  .rs  .rs
2685  .sp  .sp
2686  .nf  .nf
2687  Last updated: 18 May 2010  Last updated: 10 October 2010
2688  Copyright (c) 1997-2010 University of Cambridge.  Copyright (c) 1997-2010 University of Cambridge.
2689  .fi  .fi

Legend:
Removed from v.518  
changed lines
  Added in v.551

  ViewVC Help
Powered by ViewVC 1.1.5