/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 517 by ph10, Tue Mar 30 11:11:52 2010 UTC revision 518 by ph10, Tue May 18 15:47:01 2010 UTC
# Line 18  man page, in case the conversion went wr Line 18  man page, in case the conversion went wr
18  <li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a>  <li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a>
19  <li><a name="TOC4" href="#SEC4">BACKSLASH</a>  <li><a name="TOC4" href="#SEC4">BACKSLASH</a>
20  <li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a>  <li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a>
21  <li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT)</a>  <li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT) AND \N</a>
22  <li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a>  <li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a>
23  <li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a>  <li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a>
24  <li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a>  <li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a>
# Line 124  they must be in upper case. If more than Line 124  they must be in upper case. If more than
124  is used.  is used.
125  </P>  </P>
126  <P>  <P>
127  The newline convention does not affect what the \R escape sequence matches. By  The newline convention affects the interpretation of the dot metacharacter when
128  default, this is any Unicode newline sequence, for Perl compatibility. However,  PCRE_DOTALL is not set, and also the behaviour of \N. However, it does not
129  this can be changed; see the description of \R in the section entitled  affect what the \R escape sequence matches. By default, this is any Unicode
130    newline sequence, for Perl compatibility. However, this can be changed; see the
131    description of \R in the section entitled
132  <a href="#newlineseq">"Newline sequences"</a>  <a href="#newlineseq">"Newline sequences"</a>
133  below. A change of \R setting can be combined with a change of newline  below. A change of \R setting can be combined with a change of newline
134  convention.  convention.
# Line 308  zero, because no more than three octal d Line 310  zero, because no more than three octal d
310  <P>  <P>
311  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
312  and outside character classes. In addition, inside a character class, the  and outside character classes. In addition, inside a character class, the
313  sequence \b is interpreted as the backspace character (hex 08), and the  sequence \b is interpreted as the backspace character (hex 08). The sequences
314  sequences \R and \X are interpreted as the characters "R" and "X",  \B, \N, \R, and \X are not special inside a character class. Like any other
315  respectively. Outside a character class, these sequences have different  unrecognized escape sequences, they are treated as the literal characters "B",
316  meanings  "N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is
317  <a href="#uniextseq">(see below).</a>  set. Outside a character class, these sequences have different meanings.
318  </P>  </P>
319  <br><b>  <br><b>
320  Absolute and relative back references  Absolute and relative back references
# Line 337  Note that \g{...} (Perl syntax) and \g&# Line 339  Note that \g{...} (Perl syntax) and \g&#
339  synonymous. The former is a back reference; the latter is a  synonymous. The former is a back reference; the latter is a
340  <a href="#subpatternsassubroutines">subroutine</a>  <a href="#subpatternsassubroutines">subroutine</a>
341  call.  call.
342  </P>  <a name="genericchartypes"></a></P>
343  <br><b>  <br><b>
344  Generic character types  Generic character types
345  </b><br>  </b><br>
346  <P>  <P>
347  Another use of backslash is for specifying generic character types. The  Another use of backslash is for specifying generic character types:
 following are always recognized:  
348  <pre>  <pre>
349    \d     any decimal digit    \d     any decimal digit
350    \D     any character that is not a decimal digit    \D     any character that is not a decimal digit
# Line 356  following are always recognized: Line 357  following are always recognized:
357    \w     any "word" character    \w     any "word" character
358    \W     any "non-word" character    \W     any "non-word" character
359  </pre>  </pre>
360  Each pair of escape sequences partitions the complete set of characters into  There is also the single sequence \N, which matches a non-newline character.
361  two disjoint sets. Any given character matches one, and only one, of each pair.  This is the same as
362    <a href="#fullstopdot">the "." metacharacter</a>
363    when PCRE_DOTALL is not set.
364    </P>
365    <P>
366    Each pair of lower and upper case escape sequences partitions the complete set
367    of characters into two disjoint sets. Any given character matches one, and only
368    one, of each pair.
369  </P>  </P>
370  <P>  <P>
371  These character type sequences can appear both inside and outside character  These character type sequences can appear both inside and outside character
# Line 475  convention, for example, a pattern can s Line 483  convention, for example, a pattern can s
483  <pre>  <pre>
484    (*ANY)(*BSR_ANYCRLF)    (*ANY)(*BSR_ANYCRLF)
485  </pre>  </pre>
486  Inside a character class, \R matches the letter "R".  Inside a character class, \R is treated as an unrecognized escape sequence,
487    and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is
488    set.
489  <a name="uniextseq"></a></P>  <a name="uniextseq"></a></P>
490  <br><b>  <br><b>
491  Unicode character properties  Unicode character properties
# Line 492  The extra escape sequences are: Line 502  The extra escape sequences are:
502    \X       an extended Unicode sequence    \X       an extended Unicode sequence
503  </pre>  </pre>
504  The property names represented by <i>xx</i> above are limited to the Unicode  The property names represented by <i>xx</i> above are limited to the Unicode
505  script names, the general category properties, and "Any", which matches any  script names, the general category properties, "Any", which matches any
506  character (including newline). Other properties such as "InMusicalSymbols" are  character (including newline), and some special PCRE properties (described
507  not currently supported by PCRE. Note that \P{Any} does not match any  in the
508  characters, so always causes a match failure.  <a href="#extraprops">next section).</a>
509    Other Perl properties such as "InMusicalSymbols" are not currently supported by
510    PCRE. Note that \P{Any} does not match any characters, so always causes a
511    match failure.
512  </P>  </P>
513  <P>  <P>
514  Sets of Unicode characters are defined as belonging to certain scripts. A  Sets of Unicode characters are defined as belonging to certain scripts. A
# Line 603  Vai, Line 616  Vai,
616  Yi.  Yi.
617  </P>  </P>
618  <P>  <P>
619  Each character has exactly one general category property, specified by a  Each character has exactly one Unicode general category property, specified by
620  two-letter abbreviation. For compatibility with Perl, negation can be specified  a two-letter abbreviation. For compatibility with Perl, negation can be
621  by including a circumflex between the opening brace and the property name. For  specified by including a circumflex between the opening brace and the property
622  example, \p{^Lu} is the same as \P{Lu}.  name. For example, \p{^Lu} is the same as \P{Lu}.
623  </P>  </P>
624  <P>  <P>
625  If only one letter is specified with \p or \P, it includes all the general  If only one letter is specified with \p or \P, it includes all the general
# Line 708  Matching characters by Unicode property Line 721  Matching characters by Unicode property
721  a structure that contains data for over fifteen thousand characters. That is  a structure that contains data for over fifteen thousand characters. That is
722  why the traditional escape sequences such as \d and \w do not use Unicode  why the traditional escape sequences such as \d and \w do not use Unicode
723  properties in PCRE.  properties in PCRE.
724    <a name="extraprops"></a></P>
725    <br><b>
726    PCRE's additional properties
727    </b><br>
728    <P>
729    As well as the standard Unicode properties described in the previous
730    section, PCRE supports four more that make it possible to convert traditional
731    escape sequences such as \w and \s and POSIX character classes to use Unicode
732    properties. These are:
733    <pre>
734      Xan   Any alphanumeric character
735      Xps   Any POSIX space character
736      Xsp   Any Perl space character
737      Xwd   Any Perl "word" character
738    </pre>
739    Xan matches characters that have either the L (letter) or the N (number)
740    property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or
741    carriage return, and any other character that has the Z (separator) property.
742    Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
743    same characters as Xan, plus underscore.
744  <a name="resetmatchstart"></a></P>  <a name="resetmatchstart"></a></P>
745  <br><b>  <br><b>
746  Resetting the match start  Resetting the match start
# Line 756  The backslashed assertions are: Line 789  The backslashed assertions are:
789    \z     matches only at the end of the subject    \z     matches only at the end of the subject
790    \G     matches at the first matching position in the subject    \G     matches at the first matching position in the subject
791  </pre>  </pre>
792  These assertions may not appear in character classes (but note that \b has a  Inside a character class, \b has a different meaning; it matches the backspace
793  different meaning, namely the backspace character, inside a character class).  character. If any other of these assertions appears in a character class, by
794    default it matches the corresponding literal character (for example, \B
795    matches the letter B). However, if the PCRE_EXTRA option is set, an "invalid
796    escape sequence" error is generated instead.
797  </P>  </P>
798  <P>  <P>
799  A word boundary is a position in the subject string where the current character  A word boundary is a position in the subject string where the current character
# Line 853  PCRE_DOLLAR_ENDONLY option is ignored if Line 889  PCRE_DOLLAR_ENDONLY option is ignored if
889  Note that the sequences \A, \Z, and \z can be used to match the start and  Note that the sequences \A, \Z, and \z can be used to match the start and
890  end of the subject in both modes, and if all branches of a pattern start with  end of the subject in both modes, and if all branches of a pattern start with
891  \A it is always anchored, whether or not PCRE_MULTILINE is set.  \A it is always anchored, whether or not PCRE_MULTILINE is set.
892  </P>  <a name="fullstopdot"></a></P>
893  <br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>  <br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br>
894  <P>  <P>
895  Outside a character class, a dot in the pattern matches any one character in  Outside a character class, a dot in the pattern matches any one character in
896  the subject string except (by default) a character that signifies the end of a  the subject string except (by default) a character that signifies the end of a
# Line 879  The handling of dot is entirely independ Line 915  The handling of dot is entirely independ
915  dollar, the only relationship being that they both involve newlines. Dot has no  dollar, the only relationship being that they both involve newlines. Dot has no
916  special meaning in a character class.  special meaning in a character class.
917  </P>  </P>
918    <P>
919    The escape sequence \N always behaves as a dot does when PCRE_DOTALL is not
920    set. In other words, it matches any one character except one that signifies the
921    end of a line.
922    </P>
923  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
924  <P>  <P>
925  Outside a character class, the escape sequence \C matches any one byte, both  Outside a character class, the escape sequence \C matches any one byte, both
# Line 2548  Cambridge CB2 3QH, England. Line 2589  Cambridge CB2 3QH, England.
2589  </P>  </P>
2590  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2591  <P>  <P>
2592  Last updated: 27 March 2010  Last updated: 05 May 2010
2593  <br>  <br>
2594  Copyright &copy; 1997-2010 University of Cambridge.  Copyright &copy; 1997-2010 University of Cambridge.
2595  <br>  <br>

Legend:
Removed from v.517  
changed lines
  Added in v.518

  ViewVC Help
Powered by ViewVC 1.1.5