# Diff of /code/trunk/doc/html/pcrepattern.html

revision 77 by nigel, Sat Feb 24 21:40:45 2007 UTC revision 87 by nigel, Sat Feb 24 21:41:21 2007 UTC
# Line 175  represents: Line 175  represents:
175    \t        tab (hex 09)    \t        tab (hex 09)
176    \ddd      character with octal code ddd, or backreference    \ddd      character with octal code ddd, or backreference
177    \xhh      character with hex code hh    \xhh      character with hex code hh
178    \x{hhh..} character with hex code hhh... (UTF-8 mode only)    \x{hhh..} character with hex code hhh..
179  </pre>  </pre>
180  The precise effect of \cx is as follows: if x is a lower case letter, it  The precise effect of \cx is as follows: if x is a lower case letter, it
181  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 184  Thus \cz becomes hex 1A, but \c{ becomes Line 184  Thus \cz becomes hex 1A, but \c{ becomes
184  </P>  </P>
185  <P>  <P>
186  After \x, from zero to two hexadecimal digits are read (letters can be in  After \x, from zero to two hexadecimal digits are read (letters can be in
187  upper or lower case). In UTF-8 mode, any number of hexadecimal digits may  upper or lower case). Any number of hexadecimal digits may appear between \x{
188  appear between \x{ and }, but the value of the character code must be less  and }, but the value of the character code must be less than 256 in non-UTF-8
189  than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters  mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value
190  other than hexadecimal digits appear between \x{ and }, or if there is no  is 7FFFFFFF). If characters other than hexadecimal digits appear between \x{
191  terminating }, this form of escape is not recognized. Instead, the initial  and }, or if there is no terminating }, this form of escape is not recognized.
192  \x will be interpreted as a basic hexadecimal escape, with no following  Instead, the initial \x will be interpreted as a basic hexadecimal escape,
193  digits, giving a character whose value is zero.  with no following digits, giving a character whose value is zero.
194  </P>  </P>
195  <P>  <P>
196  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
197  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference in the  syntaxes for \x. There is no difference in the way they are handled. For
198  way they are handled. For example, \xdc is exactly the same as \x{dc}.  example, \xdc is exactly the same as \x{dc}.
199  </P>  </P>
200  <P>  <P>
201  After \0 up to two further octal digits are read. In both cases, if there  After \0 up to two further octal digits are read. In both cases, if there
# Line 285  greater than 128 are used for accented l Line 285  greater than 128 are used for accented l
285  <P>  <P>
286  In UTF-8 mode, characters with values greater than 128 never match \d, \s, or  In UTF-8 mode, characters with values greater than 128 never match \d, \s, or
287  \w, and always match \D, \S, and \W. This is true even when Unicode  \w, and always match \D, \S, and \W. This is true even when Unicode
288  character property support is available.  character property support is available. The use of locales with Unicode is
289    discouraged.
290  <a name="uniextseq"></a></P>  <a name="uniextseq"></a></P>
291  <br><b>  <br><b>
292  Unicode character properties  Unicode character properties
293  </b><br>  </b><br>
294  <P>  <P>
295  When PCRE is built with Unicode character property support, three additional  When PCRE is built with Unicode character property support, three additional
296  escape sequences to match generic character types are available when UTF-8 mode  escape sequences to match character properties are available when UTF-8 mode
297  is selected. They are:  is selected. They are:
298  <pre>  <pre>
299   \p{<i>xx</i>}   a character with the <i>xx</i> property    \p{<i>xx</i>}   a character with the <i>xx</i> property
300   \P{<i>xx</i>}   a character without the <i>xx</i> property    \P{<i>xx</i>}   a character without the <i>xx</i> property
301   \X       an extended Unicode sequence    \X       an extended Unicode sequence
302  </pre>  </pre>
303  The property names represented by <i>xx</i> above are limited to the  The property names represented by <i>xx</i> above are limited to the Unicode
304  Unicode general category properties. Each character has exactly one such  script names, the general category properties, and "Any", which matches any
305  property, specified by a two-letter abbreviation. For compatibility with Perl,  character (including newline). Other properties such as "InMusicalSymbols" are
306  negation can be specified by including a circumflex between the opening brace  not currently supported by PCRE. Note that \P{Any} does not match any
307  and the property name. For example, \p{^Lu} is the same as \P{Lu}.  characters, so always causes a match failure.
308  </P>  </P>
309  <P>  <P>
310  If only one letter is specified with \p or \P, it includes all the properties  Sets of Unicode characters are defined as belonging to certain scripts. A
311  that start with that letter. In this case, in the absence of negation, the  character from one of these sets can be matched using a script name. For
312  curly brackets in the escape sequence are optional; these two examples have  example:
313  the same effect:  <pre>
314      \p{Greek}
315      \P{Han}
316    </pre>
317    Those that are not part of an identified script are lumped together as
318    "Common". The current list of scripts is:
319    </P>
320    <P>
321    Arabic,
322    Armenian,
323    Bengali,
324    Bopomofo,
325    Braille,
326    Buginese,
327    Buhid,
329    Cherokee,
330    Common,
331    Coptic,
332    Cypriot,
333    Cyrillic,
334    Deseret,
335    Devanagari,
336    Ethiopic,
337    Georgian,
338    Glagolitic,
339    Gothic,
340    Greek,
341    Gujarati,
342    Gurmukhi,
343    Han,
344    Hangul,
345    Hanunoo,
346    Hebrew,
347    Hiragana,
348    Inherited,
350    Katakana,
351    Kharoshthi,
352    Khmer,
353    Lao,
354    Latin,
355    Limbu,
356    Linear_B,
357    Malayalam,
358    Mongolian,
359    Myanmar,
360    New_Tai_Lue,
361    Ogham,
362    Old_Italic,
363    Old_Persian,
364    Oriya,
365    Osmanya,
366    Runic,
367    Shavian,
368    Sinhala,
369    Syloti_Nagri,
370    Syriac,
371    Tagalog,
372    Tagbanwa,
373    Tai_Le,
374    Tamil,
375    Telugu,
376    Thaana,
377    Thai,
378    Tibetan,
379    Tifinagh,
380    Ugaritic,
381    Yi.
382    </P>
383    <P>
384    Each character has exactly one general category property, specified by a
385    two-letter abbreviation. For compatibility with Perl, negation can be specified
386    by including a circumflex between the opening brace and the property name. For
387    example, \p{^Lu} is the same as \P{Lu}.
388    </P>
389    <P>
390    If only one letter is specified with \p or \P, it includes all the general
391    category properties that start with that letter. In this case, in the absence
392    of negation, the curly brackets in the escape sequence are optional; these two
393    examples have the same effect:
394  <pre>  <pre>
395    \p{L}    \p{L}
396    \pL    \pL
397  </pre>  </pre>
398  The following property codes are supported:  The following general category property codes are supported:
399  <pre>  <pre>
400    C     Other    C     Other
401    Cc    Control    Cc    Control
# Line 360  The following property codes are support Line 441  The following property codes are support
441    Zp    Paragraph separator    Zp    Paragraph separator
442    Zs    Space separator    Zs    Space separator
443  </pre>  </pre>
444  Extended properties such as "Greek" or "InMusicalSymbols" are not supported by  The special property L& is also supported: it matches a character that has
445  PCRE.  the Lu, Ll, or Lt property, in other words, a letter that is not classified as
446    a modifier or "other".
447    </P>
448    <P>
449    The long synonyms for these properties that Perl supports (such as \p{Letter})
450    are not supported by PCRE. Nor is is permitted to prefix any of these
451    properties with "Is".
452    </P>
453    <P>
454    No character that is in the Unicode table has the Cn (unassigned) property.
455    Instead, this property is assumed for any code point that is not in the
456    Unicode table.
457  </P>  </P>
458  <P>  <P>
459  Specifying caseless matching does not affect these escape sequences. For  Specifying caseless matching does not affect these escape sequences. For
# Line 1360  number, provided that it occurs inside t Line 1452  number, provided that it occurs inside t
1452  (?R) is a recursive call of the entire regular expression.  (?R) is a recursive call of the entire regular expression.
1453  </P>  </P>
1454  <P>  <P>
1455  For example, this PCRE pattern solves the nested parentheses problem (assume  A recursive subpattern call is always treated as an atomic group. That is, once
1456  the PCRE_EXTENDED option is set so that white space is ignored):  it has matched some of the subject string, it is never re-entered, even if
1457    it contains untried alternatives and there is a subsequent matching failure.
1458    </P>
1459    <P>
1460    This PCRE pattern solves the nested parentheses problem (assume the
1461    PCRE_EXTENDED option is set so that white space is ignored):
1462  <pre>  <pre>
1463    \( ( (?&#62;[^()]+) | (?R) )* \)    \( ( (?&#62;[^()]+) | (?R) )* \)
1464  </pre>  </pre>
1465  First it matches an opening parenthesis. Then it matches any number of  First it matches an opening parenthesis. Then it matches any number of
1466  substrings which can either be a sequence of non-parentheses, or a recursive  substrings which can either be a sequence of non-parentheses, or a recursive
1467  match of the pattern itself (that is a correctly parenthesized substring).  match of the pattern itself (that is, a correctly parenthesized substring).
1468  Finally there is a closing parenthesis.  Finally there is a closing parenthesis.
1469  </P>  </P>
1470  <P>  <P>
# Line 1450  is used, it does match "sense and respon Line 1547  is used, it does match "sense and respon
1547  strings. Such references must, however, follow the subpattern to which they  strings. Such references must, however, follow the subpattern to which they
1548  refer.  refer.
1549  </P>  </P>
1550    <P>
1551    Like recursive subpatterns, a "subroutine" call is always treated as an atomic
1552    group. That is, once it has matched some of the subject string, it is never
1553    re-entered, even if it contains untried alternatives and there is a subsequent
1554    matching failure.
1555    </P>
1556  <br><a name="SEC20" href="#TOC1">CALLOUTS</a><br>  <br><a name="SEC20" href="#TOC1">CALLOUTS</a><br>
1557  <P>  <P>
1558  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
# Line 1486  description of the interface to the call Line 1589  description of the interface to the call
1589  documentation.  documentation.
1590  </P>  </P>
1591  <P>  <P>
1592  Last updated: 28 February 2005  Last updated: 24 January 2006
1593  <br>  <br>
1594  Copyright &copy; 1997-2005 University of Cambridge.  Copyright &copy; 1997-2006 University of Cambridge.
1595  <p>  <p>