/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 659 by ph10, Tue Aug 16 09:48:29 2011 UTC revision 678 by ph10, Sun Aug 28 15:23:03 2011 UTC
# Line 85  USER DOCUMENTATION Line 85  USER DOCUMENTATION
85           pcrecpp           details of the C++ wrapper           pcrecpp           details of the C++ wrapper
86           pcredemo          a demonstration C program that uses PCRE           pcredemo          a demonstration C program that uses PCRE
87           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
88             pcrejit           discussion of the just-in-time optimization support
89             pcrelimits        details of size and other limits
90           pcrematching      discussion of the two matching algorithms           pcrematching      discussion of the two matching algorithms
91           pcrepartial       details of the partial matching facility           pcrepartial       details of the partial matching facility
92           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
# Line 96  USER DOCUMENTATION Line 98  USER DOCUMENTATION
98           pcrestack         discussion of stack usage           pcrestack         discussion of stack usage
99           pcresyntax        quick syntax reference           pcresyntax        quick syntax reference
100           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
101             pcreunicode       discussion of Unicode and UTF-8 support
102    
103         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
104         each C library function, listing its arguments and results.         each C library function, listing its arguments and results.
105    
106    
 LIMITATIONS  
   
        There are some size limitations in PCRE but it is hoped that they  will  
        never in practice be relevant.  
   
        The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE  
        is compiled with the default internal linkage size of 2. If you want to  
        process  regular  expressions  that are truly enormous, you can compile  
        PCRE with an internal linkage size of 3 or 4 (see the  README  file  in  
        the  source  distribution and the pcrebuild documentation for details).  
        In these cases the limit is substantially larger.  However,  the  speed  
        of execution is slower.  
   
        All values in repeating quantifiers must be less than 65536.  
   
        There is no limit to the number of parenthesized subpatterns, but there  
        can be no more than 65535 capturing subpatterns.  
   
        The maximum length of name for a named subpattern is 32 characters, and  
        the maximum number of named subpatterns is 10000.  
   
        The  maximum  length of a subject string is the largest positive number  
        that an integer variable can hold. However, when using the  traditional  
        matching function, PCRE uses recursion to handle subpatterns and indef-  
        inite repetition.  This means that the available stack space may  limit  
        the size of a subject string that can be processed by certain patterns.  
        For a discussion of stack issues, see the pcrestack documentation.  
   
   
 UTF-8 AND UNICODE PROPERTY SUPPORT  
   
        From release 3.3, PCRE has  had  some  support  for  character  strings  
        encoded  in the UTF-8 format. For release 4.0 this was greatly extended  
        to cover most common requirements, and in release 5.0  additional  sup-  
        port for Unicode general category properties was added.  
   
        In  order  process  UTF-8 strings, you must build PCRE to include UTF-8  
        support in the code, and, in addition,  you  must  call  pcre_compile()  
        with  the  PCRE_UTF8  option  flag,  or the pattern must start with the  
        sequence (*UTF8). When either of these is the case,  both  the  pattern  
        and  any  subject  strings  that  are matched against it are treated as  
        UTF-8 strings instead of strings of 1-byte characters.  
   
        If you compile PCRE with UTF-8 support, but do not use it at run  time,  
        the  library will be a bit bigger, but the additional run time overhead  
        is limited to testing the PCRE_UTF8 flag occasionally, so should not be  
        very big.  
   
        If PCRE is built with Unicode character property support (which implies  
        UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-  
        ported.  The available properties that can be tested are limited to the  
        general category properties such as Lu for an upper case letter  or  Nd  
        for  a  decimal number, the Unicode script names such as Arabic or Han,  
        and the derived properties Any and L&. A full  list  is  given  in  the  
        pcrepattern documentation. Only the short names for properties are sup-  
        ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let-  
        ter},  is  not  supported.   Furthermore,  in Perl, many properties may  
        optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE  
        does not support this.  
   
    Validity of UTF-8 strings  
   
        When  you  set  the  PCRE_UTF8 flag, the strings passed as patterns and  
        subjects are (by default) checked for validity on entry to the relevant  
        functions.  From  release 7.3 of PCRE, the check is according the rules  
        of RFC 3629, which are themselves derived from the  Unicode  specifica-  
        tion.  Earlier  releases  of PCRE followed the rules of RFC 2279, which  
        allows the full range of 31-bit values (0 to 0x7FFFFFFF).  The  current  
        check allows only values in the range U+0 to U+10FFFF, excluding U+D800  
        to U+DFFF.  
   
        The excluded code points are the "Low Surrogate Area"  of  Unicode,  of  
        which  the Unicode Standard says this: "The Low Surrogate Area does not  
        contain any  character  assignments,  consequently  no  character  code  
        charts or namelists are provided for this area. Surrogates are reserved  
        for use with UTF-16 and then must be used in pairs."  The  code  points  
        that  are  encoded  by  UTF-16  pairs are available as independent code  
        points in the UTF-8 encoding. (In  other  words,  the  whole  surrogate  
        thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)  
   
        If an invalid UTF-8 string is passed to PCRE, an error return is given.  
        At compile time, the only additional information is the offset  to  the  
        first byte of the failing character. The runtime functions (pcre_exec()  
        and pcre_dfa_exec()), pass back this information  as  well  as  a  more  
        detailed  reason  code if the caller has provided memory in which to do  
        this.  
   
        In some situations, you may already know that your strings  are  valid,  
        and  therefore  want  to  skip these checks in order to improve perfor-  
        mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run  
        time,  PCRE  assumes  that  the pattern or subject it is given (respec-  
        tively) contains only valid UTF-8 codes. In  this  case,  it  does  not  
        diagnose an invalid UTF-8 string.  
   
        If  you  pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,  
        what happens depends on why the string is invalid. If the  string  con-  
        forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a  
        string of characters in the range 0  to  0x7FFFFFFF.  In  other  words,  
        apart from the initial validity test, PCRE (when in UTF-8 mode) handles  
        strings according to the more liberal rules of RFC  2279.  However,  if  
        the  string does not even conform to RFC 2279, the result is undefined.  
        Your program may crash.  
   
        If you want to process strings  of  values  in  the  full  range  0  to  
        0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can  
        set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in  
        this situation, you will have to apply your own validity check.  
   
    General comments about UTF-8 mode  
   
        1.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a  
        two-byte UTF-8 character if the value is greater than 127.  
   
        2. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8  
        characters for values greater than \177.  
   
        3.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-  
        vidual bytes, for example: \x{100}{3}.  
   
        4. The dot metacharacter matches one UTF-8 character instead of a  sin-  
        gle byte.  
   
        5.  The  escape sequence \C can be used to match a single byte in UTF-8  
        mode, but its use can lead to some strange effects.  This  facility  is  
        not available in the alternative matching function, pcre_dfa_exec().  
   
        6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly  
        test characters of any code value, but, by default, the characters that  
        PCRE  recognizes  as digits, spaces, or word characters remain the same  
        set as before, all with values less than 256. This  remains  true  even  
        when  PCRE  is built to include Unicode property support, because to do  
        otherwise would slow down PCRE in many common cases. Note in particular  
        that this applies to \b and \B, because they are defined in terms of \w  
        and \W. If you really want to test for a wider sense of, say,  "digit",  
        you  can  use  explicit Unicode property tests such as \p{Nd}. Alterna-  
        tively, if you set the PCRE_UCP option,  the  way  that  the  character  
        escapes  work  is changed so that Unicode properties are used to deter-  
        mine which characters match. There are more details in the  section  on  
        generic character types in the pcrepattern documentation.  
   
        7.  Similarly,  characters that match the POSIX named character classes  
        are all low-valued characters, unless the PCRE_UCP option is set.  
   
        8. However, the horizontal and  vertical  whitespace  matching  escapes  
        (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,  
        whether or not PCRE_UCP is set.  
   
        9. Case-insensitive matching applies only to  characters  whose  values  
        are  less than 128, unless PCRE is built with Unicode property support.  
        Even when Unicode property support is available, PCRE  still  uses  its  
        own  character  tables when checking the case of low-valued characters,  
        so as not to degrade performance.  The Unicode property information  is  
        used only for characters with higher values. Furthermore, PCRE supports  
        case-insensitive matching only  when  there  is  a  one-to-one  mapping  
        between  a letter's cases. There are a small number of many-to-one map-  
        pings in Unicode; these are not supported by PCRE.  
   
   
107  AUTHOR  AUTHOR
108    
109         Philip Hazel         Philip Hazel
# Line 272  AUTHOR Line 117  AUTHOR
117    
118  REVISION  REVISION
119    
120         Last updated: 07 May 2011         Last updated: 24 August 2011
121         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
122  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
123    
124    
125  PCREBUILD(3)                                                      PCREBUILD(3)  PCREBUILD(3)                                                      PCREBUILD(3)
126    
127    
# Line 622  REVISION Line 467  REVISION
467         Last updated: 02 August 2011         Last updated: 02 August 2011
468         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
469  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
470    
471    
472  PCREMATCHING(3)                                                PCREMATCHING(3)  PCREMATCHING(3)                                                PCREMATCHING(3)
473    
474    
# Line 826  REVISION Line 671  REVISION
671         Last updated: 17 November 2010         Last updated: 17 November 2010
672         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
673  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
674    
675    
676  PCREAPI(3)                                                          PCREAPI(3)  PCREAPI(3)                                                          PCREAPI(3)
677    
678    
# Line 1453  COMPILING A PATTERN Line 1298  COMPILING A PATTERN
1298         strings  of  UTF-8 characters instead of single-byte character strings.         strings  of  UTF-8 characters instead of single-byte character strings.
1299         However, it is available only when PCRE is built to include UTF-8  sup-         However, it is available only when PCRE is built to include UTF-8  sup-
1300         port.  If not, the use of this option provokes an error. Details of how         port.  If not, the use of this option provokes an error. Details of how
1301         this option changes the behaviour of PCRE are given in the  section  on         this option changes the behaviour of PCRE are given in the  pcreunicode
1302         UTF-8 support in the main pcre page.         page.
1303    
1304           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1305    
# Line 2998  REVISION Line 2843  REVISION
2843         Last updated: 13 August 2011         Last updated: 13 August 2011
2844         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
2845  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2846    
2847    
2848  PCRECALLOUT(3)                                                  PCRECALLOUT(3)  PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2849    
2850    
# Line 3187  REVISION Line 3032  REVISION
3032         Last updated: 31 July 2011         Last updated: 31 July 2011
3033         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
3034  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3035    
3036    
3037  PCRECOMPAT(3)                                                    PCRECOMPAT(3)  PCRECOMPAT(3)                                                    PCRECOMPAT(3)
3038    
3039    
# Line 3203  DIFFERENCES BETWEEN PCRE AND PERL Line 3048  DIFFERENCES BETWEEN PCRE AND PERL
3048         respect to Perl versions 5.10 and above.         respect to Perl versions 5.10 and above.
3049    
3050         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
3051         of what it does have are given in the section on UTF-8 support  in  the         of what it does have are given in the pcreunicode page.
        main pcre page.  
3052    
3053         2. PCRE allows repeat quantifiers only on parenthesized assertions, but         2. PCRE allows repeat quantifiers only on parenthesized assertions, but
3054         they do not mean what you might think. For example, (?!a){3}  does  not         they  do  not mean what you might think. For example, (?!a){3} does not
3055         assert that the next three characters are not "a". It just asserts that         assert that the next three characters are not "a". It just asserts that
3056         the next character is not "a" three times (in principle: PCRE optimizes         the next character is not "a" three times (in principle: PCRE optimizes
3057         this to run the assertion just once). Perl allows repeat quantifiers on         this to run the assertion just once). Perl allows repeat quantifiers on
3058         other assertions such as \b, but these do not seem to have any use.         other assertions such as \b, but these do not seem to have any use.
3059    
3060         3. Capturing subpatterns that occur inside  negative  lookahead  asser-         3.  Capturing  subpatterns  that occur inside negative lookahead asser-
3061         tions  are  counted,  but their entries in the offsets vector are never         tions are counted, but their entries in the offsets  vector  are  never
3062         set. Perl sets its numerical variables from any such patterns that  are         set.  Perl sets its numerical variables from any such patterns that are
3063         matched before the assertion fails to match something (thereby succeed-         matched before the assertion fails to match something (thereby succeed-
3064         ing), but only if the negative lookahead assertion  contains  just  one         ing),  but  only  if the negative lookahead assertion contains just one
3065         branch.         branch.
3066    
3067         4.  Though  binary zero characters are supported in the subject string,         4. Though binary zero characters are supported in the  subject  string,
3068         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
3069         mal C string, terminated by zero. The escape sequence \0 can be used in         mal C string, terminated by zero. The escape sequence \0 can be used in
3070         the pattern to represent a binary zero.         the pattern to represent a binary zero.
3071    
3072         5. The following Perl escape sequences are not supported: \l,  \u,  \L,         5.  The  following Perl escape sequences are not supported: \l, \u, \L,
3073         \U,  and  \N when followed by a character name or Unicode value. (\N on         \U, and \N when followed by a character name or Unicode value.  (\N  on
3074         its own, matching a non-newline character, is supported.) In fact these         its own, matching a non-newline character, is supported.) In fact these
3075         are  implemented  by Perl's general string-handling and are not part of         are implemented by Perl's general string-handling and are not  part  of
3076         its pattern matching engine. If any of these are encountered  by  PCRE,         its  pattern  matching engine. If any of these are encountered by PCRE,
3077         an error is generated.         an error is generated.
3078    
3079         6.  The Perl escape sequences \p, \P, and \X are supported only if PCRE         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
3080         is built with Unicode character property support. The  properties  that         is  built  with Unicode character property support. The properties that
3081         can  be tested with \p and \P are limited to the general category prop-         can be tested with \p and \P are limited to the general category  prop-
3082         erties such as Lu and Nd, script names such as Greek or  Han,  and  the         erties  such  as  Lu and Nd, script names such as Greek or Han, and the
3083         derived  properties  Any  and  L&. PCRE does support the Cs (surrogate)         derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
3084         property, which Perl does not; the  Perl  documentation  says  "Because         property,  which  Perl  does  not; the Perl documentation says "Because
3085         Perl hides the need for the user to understand the internal representa-         Perl hides the need for the user to understand the internal representa-
3086         tion of Unicode characters, there is no need to implement the  somewhat         tion  of Unicode characters, there is no need to implement the somewhat
3087         messy concept of surrogates."         messy concept of surrogates."
3088    
3089         7.  PCRE implements a simpler version of \X than Perl, which changed to         7. PCRE implements a simpler version of \X than Perl, which changed  to
3090         make \X match what Unicode calls an "extended grapheme  cluster".  This         make  \X  match what Unicode calls an "extended grapheme cluster". This
3091         is  more  complicated  than an extended Unicode sequence, which is what         is more complicated than an extended Unicode sequence,  which  is  what
3092         PCRE matches.         PCRE matches.
3093    
3094         8. PCRE does support the \Q...\E escape for quoting substrings. Charac-         8. PCRE does support the \Q...\E escape for quoting substrings. Charac-
3095         ters  in  between  are  treated as literals. This is slightly different         ters in between are treated as literals.  This  is  slightly  different
3096         from Perl in that $ and @ are  also  handled  as  literals  inside  the         from  Perl  in  that  $  and  @ are also handled as literals inside the
3097         quotes.  In Perl, they cause variable interpolation (but of course PCRE         quotes. In Perl, they cause variable interpolation (but of course  PCRE
3098         does not have variables). Note the following examples:         does not have variables). Note the following examples:
3099    
3100             Pattern            PCRE matches      Perl matches             Pattern            PCRE matches      Perl matches
# Line 3260  DIFFERENCES BETWEEN PCRE AND PERL Line 3104  DIFFERENCES BETWEEN PCRE AND PERL
3104             \Qabc\$xyz\E       abc\$xyz          abc\$xyz             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
3105             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
3106    
3107         The \Q...\E sequence is recognized both inside  and  outside  character         The  \Q...\E  sequence  is recognized both inside and outside character
3108         classes.         classes.
3109    
3110         9. Fairly obviously, PCRE does not support the (?{code}) and (??{code})         9. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
3111         constructions. However, there is support for recursive  patterns.  This         constructions.  However,  there is support for recursive patterns. This
3112         is  not  available  in Perl 5.8, but it is in Perl 5.10. Also, the PCRE         is not available in Perl 5.8, but it is in Perl 5.10.  Also,  the  PCRE
3113         "callout" feature allows an external function to be called during  pat-         "callout"  feature allows an external function to be called during pat-
3114         tern matching. See the pcrecallout documentation for details.         tern matching. See the pcrecallout documentation for details.
3115    
3116         10.  Subpatterns  that  are  called recursively or as "subroutines" are         10. Subpatterns that are called recursively  or  as  "subroutines"  are
3117         always treated as atomic groups in  PCRE.  This  is  like  Python,  but         always  treated  as  atomic  groups  in  PCRE. This is like Python, but
3118         unlike  Perl. There is a discussion of an example that explains this in         unlike Perl. There is a discussion of an example that explains this  in
3119         more detail in the section on recursion differences from  Perl  in  the         more  detail  in  the section on recursion differences from Perl in the
3120         pcrepattern page.         pcrepattern page.
3121    
3122         11.  There are some differences that are concerned with the settings of         11. There are some differences that are concerned with the settings  of
3123         captured strings when part of  a  pattern  is  repeated.  For  example,         captured  strings  when  part  of  a  pattern is repeated. For example,
3124         matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2         matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
3125         unset, but in PCRE it is set to "b".         unset, but in PCRE it is set to "b".
3126    
3127         12. PCRE's handling of duplicate subpattern numbers and duplicate  sub-         12.  PCRE's handling of duplicate subpattern numbers and duplicate sub-
3128         pattern names is not as general as Perl's. This is a consequence of the         pattern names is not as general as Perl's. This is a consequence of the
3129         fact the PCRE works internally just with numbers, using an external ta-         fact the PCRE works internally just with numbers, using an external ta-
3130         ble  to  translate  between numbers and names. In particular, a pattern         ble to translate between numbers and names. In  particular,  a  pattern
3131         such as (?|(?<a>A)|(?<b)B), where the two  capturing  parentheses  have         such  as  (?|(?<a>A)|(?<b)B),  where the two capturing parentheses have
3132         the  same  number  but different names, is not supported, and causes an         the same number but different names, is not supported,  and  causes  an
3133         error at compile time. If it were allowed, it would not be possible  to         error  at compile time. If it were allowed, it would not be possible to
3134         distinguish  which  parentheses matched, because both names map to cap-         distinguish which parentheses matched, because both names map  to  cap-
3135         turing subpattern number 1. To avoid this confusing situation, an error         turing subpattern number 1. To avoid this confusing situation, an error
3136         is given at compile time.         is given at compile time.
3137    
3138         13.  Perl  recognizes  comments  in some places that PCRE does not, for         13. Perl recognizes comments in some places that  PCRE  does  not,  for
3139         example, between the ( and ? at the start of a subpattern.  If  the  /x         example,  between  the  ( and ? at the start of a subpattern. If the /x
3140         modifier  is set, Perl allows whitespace between ( and ? but PCRE never         modifier is set, Perl allows whitespace between ( and ? but PCRE  never
3141         does, even if the PCRE_EXTENDED option is set.         does, even if the PCRE_EXTENDED option is set.
3142    
3143         14. PCRE provides some extensions to the Perl regular expression facil-         14. PCRE provides some extensions to the Perl regular expression facil-
3144         ities.   Perl  5.10  includes new features that are not in earlier ver-         ities.  Perl 5.10 includes new features that are not  in  earlier  ver-
3145         sions of Perl, some of which (such as named parentheses) have  been  in         sions  of  Perl, some of which (such as named parentheses) have been in
3146         PCRE for some time. This list is with respect to Perl 5.10:         PCRE for some time. This list is with respect to Perl 5.10:
3147    
3148         (a)  Although  lookbehind  assertions  in  PCRE must match fixed length         (a) Although lookbehind assertions in  PCRE  must  match  fixed  length
3149         strings, each alternative branch of a lookbehind assertion can match  a         strings,  each alternative branch of a lookbehind assertion can match a
3150         different  length  of  string.  Perl requires them all to have the same         different length of string. Perl requires them all  to  have  the  same
3151         length.         length.
3152    
3153         (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $         (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
3154         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
3155    
3156         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
3157         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
3158         ignored.  (Perl can be made to issue a warning.)         ignored.  (Perl can be made to issue a warning.)
3159    
3160         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
3161         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
3162         lowed by a question mark they are.         lowed by a question mark they are.
3163    
# Line 3321  DIFFERENCES BETWEEN PCRE AND PERL Line 3165  DIFFERENCES BETWEEN PCRE AND PERL
3165         tried only at the first matching position in the subject string.         tried only at the first matching position in the subject string.
3166    
3167         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
3168         and  PCRE_NO_AUTO_CAPTURE  options for pcre_exec() have no Perl equiva-         and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no  Perl  equiva-
3169         lents.         lents.
3170    
3171         (g) The \R escape sequence can be restricted to match only CR,  LF,  or         (g)  The  \R escape sequence can be restricted to match only CR, LF, or
3172         CRLF by the PCRE_BSR_ANYCRLF option.         CRLF by the PCRE_BSR_ANYCRLF option.
3173    
3174         (h) The callout facility is PCRE-specific.         (h) The callout facility is PCRE-specific.
# Line 3334  DIFFERENCES BETWEEN PCRE AND PERL Line 3178  DIFFERENCES BETWEEN PCRE AND PERL
3178         (j) Patterns compiled by PCRE can be saved and re-used at a later time,         (j) Patterns compiled by PCRE can be saved and re-used at a later time,
3179         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
3180    
3181         (k) The alternative matching function (pcre_dfa_exec())  matches  in  a         (k)  The  alternative  matching function (pcre_dfa_exec()) matches in a
3182         different way and is not Perl-compatible.         different way and is not Perl-compatible.
3183    
3184         (l)  PCRE  recognizes some special sequences such as (*CR) at the start         (l) PCRE recognizes some special sequences such as (*CR) at  the  start
3185         of a pattern that set overall options that cannot be changed within the         of a pattern that set overall options that cannot be changed within the
3186         pattern.         pattern.
3187    
# Line 3351  AUTHOR Line 3195  AUTHOR
3195    
3196  REVISION  REVISION
3197    
3198         Last updated: 24 July 2011         Last updated: 24 August 2011
3199         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
3200  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3201    
3202    
3203  PCREPATTERN(3)                                                  PCREPATTERN(3)  PCREPATTERN(3)                                                  PCREPATTERN(3)
3204    
3205    
# Line 3391  PCRE REGULAR EXPRESSION DETAILS Line 3235  PCRE REGULAR EXPRESSION DETAILS
3235         Starting a pattern with this sequence  is  equivalent  to  setting  the         Starting a pattern with this sequence  is  equivalent  to  setting  the
3236         PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting         PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting
3237         UTF-8 mode affects pattern matching  is  mentioned  in  several  places         UTF-8 mode affects pattern matching  is  mentioned  in  several  places
3238         below.  There  is  also  a  summary of UTF-8 features in the section on         below.  There  is  also  a summary of UTF-8 features in the pcreunicode
3239         UTF-8 support in the main pcre page.         page.
3240    
3241         Another special sequence that may appear at the start of a  pattern  or         Another special sequence that may appear at the start of a  pattern  or
3242         in combination with (*UTF8) is:         in combination with (*UTF8) is:
# Line 5860  AUTHOR Line 5704  AUTHOR
5704    
5705  REVISION  REVISION
5706    
5707         Last updated: 24 July 2011         Last updated: 24 August 2011
5708         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
5709  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5710    
5711    
5712  PCRESYNTAX(3)                                                    PCRESYNTAX(3)  PCRESYNTAX(3)                                                    PCRESYNTAX(3)
5713    
5714    
# Line 6233  REVISION Line 6077  REVISION
6077         Last updated: 21 November 2010         Last updated: 21 November 2010
6078         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
6079  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6080    
6081    
6082    PCREUNICODE(3)                                                  PCREUNICODE(3)
6083    
6084    
6085    NAME
6086           PCRE - Perl-compatible regular expressions
6087    
6088    
6089    UTF-8 AND UNICODE PROPERTY SUPPORT
6090    
6091           In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
6092           support in the code, and, in addition,  you  must  call  pcre_compile()
6093           with  the  PCRE_UTF8  option  flag,  or the pattern must start with the
6094           sequence (*UTF8). When either of these is the case,  both  the  pattern
6095           and  any  subject  strings  that  are matched against it are treated as
6096           UTF-8 strings instead of strings of 1-byte characters.  PCRE  does  not
6097           support any other formats (in particular, it does not support UTF-16).
6098    
6099           If  you compile PCRE with UTF-8 support, but do not use it at run time,
6100           the library will be a bit bigger, but the additional run time  overhead
6101           is limited to testing the PCRE_UTF8 flag occasionally, so should not be
6102           very big.
6103    
6104           If PCRE is built with Unicode character property support (which implies
6105           UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
6106           ported.  The available properties that can be tested are limited to the
6107           general  category  properties such as Lu for an upper case letter or Nd
6108           for a decimal number, the Unicode script names such as Arabic  or  Han,
6109           and  the  derived  properties  Any  and L&. A full list is given in the
6110           pcrepattern documentation. Only the short names for properties are sup-
6111           ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
6112           ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
6113           optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
6114           does not support this.
6115    
6116       Validity of UTF-8 strings
6117    
6118           When you set the PCRE_UTF8 flag, the strings  passed  as  patterns  and
6119           subjects are (by default) checked for validity on entry to the relevant
6120           functions. From release 7.3 of PCRE, the check is according  the  rules
6121           of  RFC  3629, which are themselves derived from the Unicode specifica-
6122           tion. Earlier releases of PCRE followed the rules of  RFC  2279,  which
6123           allows  the  full range of 31-bit values (0 to 0x7FFFFFFF). The current
6124           check allows only values in the range U+0 to U+10FFFF, excluding U+D800
6125           to U+DFFF.
6126    
6127           The  excluded  code  points are the "Low Surrogate Area" of Unicode, of
6128           which the Unicode Standard says this: "The Low Surrogate Area does  not
6129           contain  any  character  assignments,  consequently  no  character code
6130           charts or namelists are provided for this area. Surrogates are reserved
6131           for  use  with  UTF-16 and then must be used in pairs." The code points
6132           that are encoded by UTF-16 pairs  are  available  as  independent  code
6133           points  in  the  UTF-8  encoding.  (In other words, the whole surrogate
6134           thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
6135    
6136           If an invalid UTF-8 string is passed to PCRE, an error return is given.
6137           At  compile  time, the only additional information is the offset to the
6138           first byte of the failing character. The runtime functions  pcre_exec()
6139           and  pcre_dfa_exec() also pass back this information, as well as a more
6140           detailed reason code if the caller has provided memory in which  to  do
6141           this.
6142    
6143           In  some  situations, you may already know that your strings are valid,
6144           and therefore want to skip these checks in  order  to  improve  perfor-
6145           mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run
6146           time, PCRE assumes that the pattern or subject  it  is  given  (respec-
6147           tively)  contains  only  valid  UTF-8  codes. In this case, it does not
6148           diagnose an invalid UTF-8 string.
6149    
6150           If you pass an invalid UTF-8 string  when  PCRE_NO_UTF8_CHECK  is  set,
6151           what  happens  depends on why the string is invalid. If the string con-
6152           forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
6153           string  of  characters  in  the  range 0 to 0x7FFFFFFF. In other words,
6154           apart from the initial validity test, PCRE (when in UTF-8 mode) handles
6155           strings  according  to  the more liberal rules of RFC 2279. However, if
6156           the string does not even conform to RFC 2279, the result is  undefined.
6157           Your program may crash.
6158    
6159           If  you  want  to  process  strings  of  values  in the full range 0 to
6160           0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you  can
6161           set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
6162           this situation, you will have to apply your own validity check.
6163    
6164       General comments about UTF-8 mode
6165    
6166           1. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
6167           two-byte UTF-8 character if the value is greater than 127.
6168    
6169           2.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
6170           characters for values greater than \177.
6171    
6172           3. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
6173           vidual bytes, for example: \x{100}{3}.
6174    
6175           4.  The dot metacharacter matches one UTF-8 character instead of a sin-
6176           gle byte.
6177    
6178           5. The escape sequence \C can be used to match a single byte  in  UTF-8
6179           mode,  but  its  use can lead to some strange effects. This facility is
6180           not available in the alternative matching function, pcre_dfa_exec().
6181    
6182           6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
6183           test characters of any code value, but, by default, the characters that
6184           PCRE recognizes as digits, spaces, or word characters remain  the  same
6185           set  as  before,  all with values less than 256. This remains true even
6186           when PCRE is built to include Unicode property support, because  to  do
6187           otherwise would slow down PCRE in many common cases. Note in particular
6188           that this applies to \b and \B, because they are defined in terms of \w
6189           and  \W. If you really want to test for a wider sense of, say, "digit",
6190           you can use explicit Unicode property tests such  as  \p{Nd}.  Alterna-
6191           tively,  if  you  set  the  PCRE_UCP option, the way that the character
6192           escapes work is changed so that Unicode properties are used  to  deter-
6193           mine  which  characters match. There are more details in the section on
6194           generic character types in the pcrepattern documentation.
6195    
6196           7. Similarly, characters that match the POSIX named  character  classes
6197           are all low-valued characters, unless the PCRE_UCP option is set.
6198    
6199           8.  However,  the  horizontal  and vertical whitespace matching escapes
6200           (\h, \H, \v, and \V) do match all the appropriate  Unicode  characters,
6201           whether or not PCRE_UCP is set.
6202    
6203           9.  Case-insensitive  matching  applies only to characters whose values
6204           are less than 128, unless PCRE is built with Unicode property  support.
6205           Even  when  Unicode  property support is available, PCRE still uses its
6206           own character tables when checking the case of  low-valued  characters,
6207           so  as not to degrade performance.  The Unicode property information is
6208           used only for characters with higher values. Furthermore, PCRE supports
6209           case-insensitive  matching  only  when  there  is  a one-to-one mapping
6210           between a letter's cases. There are a small number of many-to-one  map-
6211           pings in Unicode; these are not supported by PCRE.
6212    
6213    
6214    AUTHOR
6215    
6216           Philip Hazel
6217           University Computing Service
6218           Cambridge CB2 3QH, England.
6219    
6220    
6221    REVISION
6222    
6223           Last updated: 24 August 2011
6224           Copyright (c) 1997-2011 University of Cambridge.
6225    ------------------------------------------------------------------------------
6226    
6227    
6228    ------------------------------------------------------------------------------
6229    
6230    
6231  PCREPARTIAL(3)                                                  PCREPARTIAL(3)  PCREPARTIAL(3)                                                  PCREPARTIAL(3)
6232    
6233    
# Line 6653  REVISION Line 6646  REVISION
6646         Last updated: 07 November 2010         Last updated: 07 November 2010
6647         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
6648  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6649    
6650    
6651  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
6652    
6653    
# Line 6778  REVISION Line 6771  REVISION
6771         Last updated: 17 November 2010         Last updated: 17 November 2010
6772         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
6773  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6774    
6775    
6776  PCREPERFORM(3)                                                  PCREPERFORM(3)  PCREPERFORM(3)                                                  PCREPERFORM(3)
6777    
6778    
# Line 6946  REVISION Line 6939  REVISION
6939         Last updated: 16 May 2010         Last updated: 16 May 2010
6940         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
6941  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6942    
6943    
6944  PCREPOSIX(3)                                                      PCREPOSIX(3)  PCREPOSIX(3)                                                      PCREPOSIX(3)
6945    
6946    
# Line 7209  REVISION Line 7202  REVISION
7202         Last updated: 16 May 2010         Last updated: 16 May 2010
7203         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
7204  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7205    
7206    
7207  PCRECPP(3)                                                          PCRECPP(3)  PCRECPP(3)                                                          PCRECPP(3)
7208    
7209    
# Line 7551  REVISION Line 7544  REVISION
7544         Last updated: 17 March 2009         Last updated: 17 March 2009
7545         Minor typo fixed: 25 July 2011         Minor typo fixed: 25 July 2011
7546  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7547    
7548    
7549  PCRESAMPLE(3)                                                    PCRESAMPLE(3)  PCRESAMPLE(3)                                                    PCRESAMPLE(3)
7550    
7551    
# Line 7638  REVISION Line 7631  REVISION
7631         Last updated: 17 November 2010         Last updated: 17 November 2010
7632         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
7633  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7634    PCRELIMITS(3)                                                    PCRELIMITS(3)
7635    
7636    
7637    NAME
7638           PCRE - Perl-compatible regular expressions
7639    
7640    
7641    SIZE AND OTHER LIMITATIONS
7642    
7643           There  are some size limitations in PCRE but it is hoped that they will
7644           never in practice be relevant.
7645    
7646           The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE
7647           is compiled with the default internal linkage size of 2. If you want to
7648           process regular expressions that are truly enormous,  you  can  compile
7649           PCRE  with  an  internal linkage size of 3 or 4 (see the README file in
7650           the source distribution and the pcrebuild documentation  for  details).
7651           In  these  cases the limit is substantially larger.  However, the speed
7652           of execution is slower.
7653    
7654           All values in repeating quantifiers must be less than 65536.
7655    
7656           There is no limit to the number of parenthesized subpatterns, but there
7657           can be no more than 65535 capturing subpatterns.
7658    
7659           The maximum length of name for a named subpattern is 32 characters, and
7660           the maximum number of named subpatterns is 10000.
7661    
7662           The maximum length of a subject string is the largest  positive  number
7663           that  an integer variable can hold. However, when using the traditional
7664           matching function, PCRE uses recursion to handle subpatterns and indef-
7665           inite  repetition.  This means that the available stack space may limit
7666           the size of a subject string that can be processed by certain patterns.
7667           For a discussion of stack issues, see the pcrestack documentation.
7668    
7669    
7670    AUTHOR
7671    
7672           Philip Hazel
7673           University Computing Service
7674           Cambridge CB2 3QH, England.
7675    
7676    
7677    REVISION
7678    
7679           Last updated: 24 August 2011
7680           Copyright (c) 1997-2011 University of Cambridge.
7681    ------------------------------------------------------------------------------
7682    
7683    
7684  PCRESTACK(3)                                                      PCRESTACK(3)  PCRESTACK(3)                                                      PCRESTACK(3)
7685    
7686    
# Line 7789  REVISION Line 7832  REVISION
7832         Last updated: 22 July 2011         Last updated: 22 July 2011
7833         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
7834  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7835    
7836    

Legend:
Removed from v.659  
changed lines
  Added in v.678

  ViewVC Help
Powered by ViewVC 1.1.5