/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 518 by ph10, Tue May 18 15:47:01 2010 UTC revision 555 by ph10, Tue Oct 26 08:26:20 2010 UTC
# Line 42  in the main Line 42  in the main
42  .\"  .\"
43  page.  page.
44  .P  .P
45  Another special sequence that may appear at the start of a pattern or in  Another special sequence that may appear at the start of a pattern or in
46  combination with (*UTF8) is:  combination with (*UTF8) is:
47  .sp  .sp
48    (*UCP)    (*UCP)
49  .sp  .sp
50  This has the same effect as setting the PCRE_UCP option: it causes sequences  This has the same effect as setting the PCRE_UCP option: it causes sequences
51  such as \ed and \ew to use Unicode properties to determine character types,  such as \ed and \ew to use Unicode properties to determine character types,
52  instead of recognizing only characters with codes less than 128 via a lookup  instead of recognizing only characters with codes less than 128 via a lookup
53  table.  table.
54  .P  .P
55  The remainder of this document discusses the patterns that are supported by  The remainder of this document discusses the patterns that are supported by
# Line 210  Perl, $ and @ cause variable interpolati Line 210  Perl, $ and @ cause variable interpolati
210    \eQabc\eE\e$\eQxyz\eE   abc$xyz        abc$xyz    \eQabc\eE\e$\eQxyz\eE   abc$xyz        abc$xyz
211  .sp  .sp
212  The \eQ...\eE sequence is recognized both inside and outside character classes.  The \eQ...\eE sequence is recognized both inside and outside character classes.
213    An isolated \eE that is not preceded by \eQ is ignored.
214  .  .
215  .  .
216  .\" HTML <a name="digitsafterbackslash"></a>  .\" HTML <a name="digitsafterbackslash"></a>
# Line 367  Another use of backslash is for specifyi Line 368  Another use of backslash is for specifyi
368    \ew     any "word" character    \ew     any "word" character
369    \eW     any "non-word" character    \eW     any "non-word" character
370  .sp  .sp
371  There is also the single sequence \eN, which matches a non-newline character.  There is also the single sequence \eN, which matches a non-newline character.
372  This is the same as  This is the same as
373  .\" HTML <a href="#fullstopdot">  .\" HTML <a href="#fullstopdot">
374  .\" </a>  .\" </a>
375  the "." metacharacter  the "." metacharacter
376  .\"  .\"
377  when PCRE_DOTALL is not set.  when PCRE_DOTALL is not set.
378  .P  .P
# Line 408  Unicode is discouraged. Line 409  Unicode is discouraged.
409  By default, in UTF-8 mode, characters with values greater than 128 never match  By default, in UTF-8 mode, characters with values greater than 128 never match
410  \ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain  \ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain
411  their original meanings from before UTF-8 support was available, mainly for  their original meanings from before UTF-8 support was available, mainly for
412  efficiency reasons. However, if PCRE is compiled with Unicode property support,  efficiency reasons. However, if PCRE is compiled with Unicode property support,
413  and the PCRE_UCP option is set, the behaviour is changed so that Unicode  and the PCRE_UCP option is set, the behaviour is changed so that Unicode
414  properties are used to determine character types, as follows:  properties are used to determine character types, as follows:
415  .sp  .sp
# Line 417  properties are used to determine charact Line 418  properties are used to determine charact
418    \ew  any character that \ep{L} or \ep{N} matches, plus underscore    \ew  any character that \ep{L} or \ep{N} matches, plus underscore
419  .sp  .sp
420  The upper case escapes match the inverse sets of characters. Note that \ed  The upper case escapes match the inverse sets of characters. Note that \ed
421  matches only decimal digits, whereas \ew matches any Unicode digit, as well as  matches only decimal digits, whereas \ew matches any Unicode digit, as well as
422  any Unicode letter, and underscore. Note also that PCRE_UCP affects \eb, and  any Unicode letter, and underscore. Note also that PCRE_UCP affects \eb, and
423  \eB because they are defined in terms of \ew and \eW. Matching these sequences  \eB because they are defined in terms of \ew and \eW. Matching these sequences
424  is noticeably slower when PCRE_UCP is set.  is noticeably slower when PCRE_UCP is set.
425  .P  .P
426  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the
427  other sequences, which match only ASCII characters by default, these always  other sequences, which match only ASCII characters by default, these always
428  match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is  match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is
429  set. The horizontal space characters are:  set. The horizontal space characters are:
430  .sp  .sp
431    U+0009     Horizontal tab    U+0009     Horizontal tab
# Line 527  The extra escape sequences are: Line 528  The extra escape sequences are:
528  The property names represented by \fIxx\fP above are limited to the Unicode  The property names represented by \fIxx\fP above are limited to the Unicode
529  script names, the general category properties, "Any", which matches any  script names, the general category properties, "Any", which matches any
530  character (including newline), and some special PCRE properties (described  character (including newline), and some special PCRE properties (described
531  in the  in the
532  .\" HTML <a href="#extraprops">  .\" HTML <a href="#extraprops">
533  .\" </a>  .\" </a>
534  next section).  next section).
535  .\"  .\"
536  Other Perl properties such as "InMusicalSymbols" are not currently supported by  Other Perl properties such as "InMusicalSymbols" are not currently supported by
537  PCRE. Note that \eP{Any} does not match any characters, so always causes a  PCRE. Note that \eP{Any} does not match any characters, so always causes a
# Line 741  non-UTF-8 mode \eX matches any one chara Line 742  non-UTF-8 mode \eX matches any one chara
742  Matching characters by Unicode property is not fast, because PCRE has to search  Matching characters by Unicode property is not fast, because PCRE has to search
743  a structure that contains data for over fifteen thousand characters. That is  a structure that contains data for over fifteen thousand characters. That is
744  why the traditional escape sequences such as \ed and \ew do not use Unicode  why the traditional escape sequences such as \ed and \ew do not use Unicode
745  properties in PCRE by default, though you can make them do so by setting the  properties in PCRE by default, though you can make them do so by setting the
746  PCRE_UCP option for \fBpcre_compile()\fP or by starting the pattern with  PCRE_UCP option for \fBpcre_compile()\fP or by starting the pattern with
747  (*UCP).  (*UCP).
748  .  .
# Line 750  PCRE_UCP option for \fBpcre_compile()\fP Line 751  PCRE_UCP option for \fBpcre_compile()\fP
751  .SS PCRE's additional properties  .SS PCRE's additional properties
752  .rs  .rs
753  .sp  .sp
754  As well as the standard Unicode properties described in the previous  As well as the standard Unicode properties described in the previous
755  section, PCRE supports four more that make it possible to convert traditional  section, PCRE supports four more that make it possible to convert traditional
756  escape sequences such as \ew and \es and POSIX character classes to use Unicode  escape sequences such as \ew and \es and POSIX character classes to use Unicode
757  properties. PCRE uses these non-standard, non-Perl properties internally when  properties. PCRE uses these non-standard, non-Perl properties internally when
758  PCRE_UCP is set. They are:  PCRE_UCP is set. They are:
# Line 761  PCRE_UCP is set. They are: Line 762  PCRE_UCP is set. They are:
762    Xsp   Any Perl space character    Xsp   Any Perl space character
763    Xwd   Any Perl "word" character    Xwd   Any Perl "word" character
764  .sp  .sp
765  Xan matches characters that have either the L (letter) or the N (number)  Xan matches characters that have either the L (letter) or the N (number)
766  property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or  property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or
767  carriage return, and any other character that has the Z (separator) property.  carriage return, and any other character that has the Z (separator) property.
768  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
769  same characters as Xan, plus underscore.  same characters as Xan, plus underscore.
770  .  .
771  .  .
# Line 825  The backslashed assertions are: Line 826  The backslashed assertions are:
826    \eG     matches at the first matching position in the subject    \eG     matches at the first matching position in the subject
827  .sp  .sp
828  Inside a character class, \eb has a different meaning; it matches the backspace  Inside a character class, \eb has a different meaning; it matches the backspace
829  character. If any other of these assertions appears in a character class, by  character. If any other of these assertions appears in a character class, by
830  default it matches the corresponding literal character (for example, \eB  default it matches the corresponding literal character (for example, \eB
831  matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid  matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
832  escape sequence" error is generated instead.  escape sequence" error is generated instead.
# Line 946  The handling of dot is entirely independ Line 947  The handling of dot is entirely independ
947  dollar, the only relationship being that they both involve newlines. Dot has no  dollar, the only relationship being that they both involve newlines. Dot has no
948  special meaning in a character class.  special meaning in a character class.
949  .P  .P
950  The escape sequence \eN always behaves as a dot does when PCRE_DOTALL is not  The escape sequence \eN always behaves as a dot does when PCRE_DOTALL is not
951  set. In other words, it matches any one character except one that signifies the  set. In other words, it matches any one character except one that signifies the
952  end of a line.  end of a line.
953  .  .
954  .  .
# Line 1102  supported, and an error is given if they Line 1103  supported, and an error is given if they
1103  .P  .P
1104  By default, in UTF-8 mode, characters with values greater than 128 do not match  By default, in UTF-8 mode, characters with values greater than 128 do not match
1105  any of the POSIX character classes. However, if the PCRE_UCP option is passed  any of the POSIX character classes. However, if the PCRE_UCP option is passed
1106  to \fBpcre_compile()\fP, some of the classes are changed so that Unicode  to \fBpcre_compile()\fP, some of the classes are changed so that Unicode
1107  character properties are used. This is achieved by replacing the POSIX classes  character properties are used. This is achieved by replacing the POSIX classes
1108  by other sequences, as follows:  by other sequences, as follows:
1109  .sp  .sp
1110    [:alnum:]  becomes  \ep{Xan}    [:alnum:]  becomes  \ep{Xan}
1111    [:alpha:]  becomes  \ep{L}    [:alpha:]  becomes  \ep{L}
1112    [:blank:]  becomes  \eh    [:blank:]  becomes  \eh
1113    [:digit:]  becomes  \ep{Nd}    [:digit:]  becomes  \ep{Nd}
1114    [:lower:]  becomes  \ep{Ll}    [:lower:]  becomes  \ep{Ll}
1115    [:space:]  becomes  \ep{Xps}    [:space:]  becomes  \ep{Xps}
1116    [:upper:]  becomes  \ep{Lu}    [:upper:]  becomes  \ep{Lu}
1117    [:word:]   becomes  \ep{Xwd}    [:word:]   becomes  \ep{Xwd}
1118  .sp  .sp
# Line 2630  matching name is found, normal "bumpalon Line 2631  matching name is found, normal "bumpalon
2631  .sp  .sp
2632    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2633  .sp  .sp
2634  This verb causes a skip to the next alternation if the rest of the pattern does  This verb causes a skip to the next alternation in the innermost enclosing
2635  not match. That is, it cancels pending backtracking, but only within the  group if the rest of the pattern does not match. That is, it cancels pending
2636  current alternation. Its name comes from the observation that it can be used  backtracking, but only within the current alternation. Its name comes from the
2637  for a pattern-based if-then-else block:  observation that it can be used for a pattern-based if-then-else block:
2638  .sp  .sp
2639    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2640  .sp  .sp
# Line 2644  behaviour of (*THEN:NAME) is exactly the Line 2645  behaviour of (*THEN:NAME) is exactly the
2645  overall match fails. If (*THEN) is not directly inside an alternation, it acts  overall match fails. If (*THEN) is not directly inside an alternation, it acts
2646  like (*PRUNE).  like (*PRUNE).
2647  .  .
2648    .P
2649    The above verbs provide four different "strengths" of control when subsequent
2650    matching fails. (*THEN) is the weakest, carrying on the match at the next
2651    alternation. (*PRUNE) comes next, failing the match at the current starting
2652    position, but allowing an advance to the next character (for an unanchored
2653    pattern). (*SKIP) is similar, except that the advance may be more than one
2654    character. (*COMMIT) is the strongest, causing the entire match to fail.
2655    .P
2656    If more than one is present in a pattern, the "stongest" one wins. For example,
2657    consider this pattern, where A, B, etc. are complex pattern fragments:
2658    .sp
2659      (A(*COMMIT)B(*THEN)C|D)
2660    .sp
2661    Once A has matched, PCRE is committed to this match, at the current starting
2662    position. If subsequently B matches, but C does not, the normal (*THEN) action
2663    of trying the next alternation (that is, D) does not happen because (*COMMIT)
2664    overrides.
2665    .
2666  .  .
2667  .SH "SEE ALSO"  .SH "SEE ALSO"
2668  .rs  .rs
# Line 2666  Cambridge CB2 3QH, England. Line 2685  Cambridge CB2 3QH, England.
2685  .rs  .rs
2686  .sp  .sp
2687  .nf  .nf
2688  Last updated: 18 May 2010  Last updated: 26 October 2010
2689  Copyright (c) 1997-2010 University of Cambridge.  Copyright (c) 1997-2010 University of Cambridge.
2690  .fi  .fi

Legend:
Removed from v.518  
changed lines
  Added in v.555

  ViewVC Help
Powered by ViewVC 1.1.5