/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 185 by ph10, Tue Jun 19 13:39:46 2007 UTC revision 197 by ph10, Tue Jul 31 10:50:18 2007 UTC
# Line 114  LIMITATIONS Line 114  LIMITATIONS
114         There is no limit to the number of parenthesized subpatterns, but there         There is no limit to the number of parenthesized subpatterns, but there
115         can be no more than 65535 capturing subpatterns.         can be no more than 65535 capturing subpatterns.
116    
117           If  a  non-capturing subpattern with an unlimited repetition quantifier
118           can match an empty string, there is a limit of 1000 on  the  number  of
119           times  it  can  be  repeated while not matching an empty string - if it
120           does match an empty string, the loop is immediately broken.
121    
122         The maximum length of name for a named subpattern is 32 characters, and         The maximum length of name for a named subpattern is 32 characters, and
123         the maximum number of named subpatterns is 10000.         the maximum number of named subpatterns is 10000.
124    
125         The maximum length of a subject string is the largest  positive  number         The  maximum  length of a subject string is the largest positive number
126         that  an integer variable can hold. However, when using the traditional         that an integer variable can hold. However, when using the  traditional
127         matching function, PCRE uses recursion to handle subpatterns and indef-         matching function, PCRE uses recursion to handle subpatterns and indef-
128         inite  repetition.  This means that the available stack space may limit         inite repetition.  This means that the available stack space may  limit
129         the size of a subject string that can be processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
130         For a discussion of stack issues, see the pcrestack documentation.         For a discussion of stack issues, see the pcrestack documentation.
131    
132    
133  UTF-8 AND UNICODE PROPERTY SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
134    
135         From  release  3.3,  PCRE  has  had  some support for character strings         From release 3.3, PCRE has  had  some  support  for  character  strings
136         encoded in the UTF-8 format. For release 4.0 this was greatly  extended         encoded  in the UTF-8 format. For release 4.0 this was greatly extended
137         to  cover  most common requirements, and in release 5.0 additional sup-         to cover most common requirements, and in release 5.0  additional  sup-
138         port for Unicode general category properties was added.         port for Unicode general category properties was added.
139    
140         In order process UTF-8 strings, you must build PCRE  to  include  UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
141         support  in  the  code,  and, in addition, you must call pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
142         with the PCRE_UTF8 option flag. When you do this, both the pattern  and         with  the PCRE_UTF8 option flag. When you do this, both the pattern and
143         any  subject  strings  that are matched against it are treated as UTF-8         any subject strings that are matched against it are  treated  as  UTF-8
144         strings instead of just strings of bytes.         strings instead of just strings of bytes.
145    
146         If you compile PCRE with UTF-8 support, but do not use it at run  time,         If  you compile PCRE with UTF-8 support, but do not use it at run time,
147         the  library will be a bit bigger, but the additional run time overhead         the library will be a bit bigger, but the additional run time  overhead
148         is limited to testing the PCRE_UTF8 flag occasionally, so should not be         is limited to testing the PCRE_UTF8 flag occasionally, so should not be
149         very big.         very big.
150    
151         If PCRE is built with Unicode character property support (which implies         If PCRE is built with Unicode character property support (which implies
152         UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
153         ported.  The available properties that can be tested are limited to the         ported.  The available properties that can be tested are limited to the
154         general category properties such as Lu for an upper case letter  or  Nd         general  category  properties such as Lu for an upper case letter or Nd
155         for  a  decimal number, the Unicode script names such as Arabic or Han,         for a decimal number, the Unicode script names such as Arabic  or  Han,
156         and the derived properties Any and L&. A full  list  is  given  in  the         and  the  derived  properties  Any  and L&. A full list is given in the
157         pcrepattern documentation. Only the short names for properties are sup-         pcrepattern documentation. Only the short names for properties are sup-
158         ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let-         ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
159         ter},  is  not  supported.   Furthermore,  in Perl, many properties may         ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
160         optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE         optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
161         does not support this.         does not support this.
162    
163         The following comments apply when PCRE is running in UTF-8 mode:         The following comments apply when PCRE is running in UTF-8 mode:
164    
165         1.  When you set the PCRE_UTF8 flag, the strings passed as patterns and         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
166         subjects are checked for validity on entry to the  relevant  functions.         subjects  are  checked for validity on entry to the relevant functions.
167         If an invalid UTF-8 string is passed, an error return is given. In some         If an invalid UTF-8 string is passed, an error return is given. In some
168         situations, you may already know  that  your  strings  are  valid,  and         situations,  you  may  already  know  that  your strings are valid, and
169         therefore want to skip these checks in order to improve performance. If         therefore want to skip these checks in order to improve performance. If
170         you set the PCRE_NO_UTF8_CHECK flag at compile time  or  at  run  time,         you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
171         PCRE  assumes  that  the  pattern or subject it is given (respectively)         PCRE assumes that the pattern or subject  it  is  given  (respectively)
172         contains only valid UTF-8 codes. In this case, it does not diagnose  an         contains  only valid UTF-8 codes. In this case, it does not diagnose an
173         invalid  UTF-8 string. If you pass an invalid UTF-8 string to PCRE when         invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
174         PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program  may         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
175         crash.         crash.
176    
177         2.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a         2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
178         two-byte UTF-8 character if the value is greater than 127.         two-byte UTF-8 character if the value is greater than 127.
179    
180         3. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8         3.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
181         characters for values greater than \177.         characters for values greater than \177.
182    
183         4.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
184         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
185    
186         5. The dot metacharacter matches one UTF-8 character instead of a  sin-         5.  The dot metacharacter matches one UTF-8 character instead of a sin-
187         gle byte.         gle byte.
188    
189         6.  The  escape sequence \C can be used to match a single byte in UTF-8         6. The escape sequence \C can be used to match a single byte  in  UTF-8
190         mode, but its use can lead to some strange effects.  This  facility  is         mode,  but  its  use can lead to some strange effects. This facility is
191         not available in the alternative matching function, pcre_dfa_exec().         not available in the alternative matching function, pcre_dfa_exec().
192    
193         7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
194         test characters of any code value, but the characters that PCRE  recog-         test  characters of any code value, but the characters that PCRE recog-
195         nizes  as  digits,  spaces,  or  word characters remain the same set as         nizes as digits, spaces, or word characters  remain  the  same  set  as
196         before, all with values less than 256. This remains true even when PCRE         before, all with values less than 256. This remains true even when PCRE
197         includes  Unicode  property support, because to do otherwise would slow         includes Unicode property support, because to do otherwise  would  slow
198         down PCRE in many common cases. If you really want to test for a  wider         down  PCRE in many common cases. If you really want to test for a wider
199         sense  of,  say,  "digit",  you must use Unicode property tests such as         sense of, say, "digit", you must use Unicode  property  tests  such  as
200         \p{Nd}.         \p{Nd}.
201    
202         8. Similarly, characters that match the POSIX named  character  classes         8.  Similarly,  characters that match the POSIX named character classes
203         are all low-valued characters.         are all low-valued characters.
204    
205         9.  However,  the Perl 5.10 horizontal and vertical whitespace matching         9. However, the Perl 5.10 horizontal and vertical  whitespace  matching
206         escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-         escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
207         acters.         acters.
208    
209         10.  Case-insensitive  matching applies only to characters whose values         10. Case-insensitive matching applies only to characters  whose  values
210         are less than 128, unless PCRE is built with Unicode property  support.         are  less than 128, unless PCRE is built with Unicode property support.
211         Even  when  Unicode  property support is available, PCRE still uses its         Even when Unicode property support is available, PCRE  still  uses  its
212         own character tables when checking the case of  low-valued  characters,         own  character  tables when checking the case of low-valued characters,
213         so  as not to degrade performance.  The Unicode property information is         so as not to degrade performance.  The Unicode property information  is
214         used only for characters with higher values. Even when Unicode property         used only for characters with higher values. Even when Unicode property
215         support is available, PCRE supports case-insensitive matching only when         support is available, PCRE supports case-insensitive matching only when
216         there is a one-to-one mapping between a letter's  cases.  There  are  a         there  is  a  one-to-one  mapping between a letter's cases. There are a
217         small  number  of  many-to-one  mappings in Unicode; these are not sup-         small number of many-to-one mappings in Unicode;  these  are  not  sup-
218         ported by PCRE.         ported by PCRE.
219    
220    
# Line 219  AUTHOR Line 224  AUTHOR
224         University Computing Service         University Computing Service
225         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
226    
227         Putting an actual email address here seems to have been a spam  magnet,         Putting  an actual email address here seems to have been a spam magnet,
228         so  I've  taken  it away. If you want to email me, use my two initials,         so I've taken it away. If you want to email me, use  my  two  initials,
229         followed by the two digits 10, at the domain cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
230    
231    
232  REVISION  REVISION
233    
234         Last updated: 13 June 2007         Last updated: 30 July 2007
235         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
236  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
237    
# Line 459  USING EBCDIC CODE Line 464  USING EBCDIC CODE
464    
465         PCRE  assumes  by  default that it will run in an environment where the         PCRE  assumes  by  default that it will run in an environment where the
466         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
467         PCRE  can,  however,  be  compiled  to  run in an EBCDIC environment by         This  is  the  case for most computer operating systems. PCRE can, how-
468         adding         ever, be compiled to run in an EBCDIC environment by adding
469    
470           --enable-ebcdic           --enable-ebcdic
471    
472         to the configure command. This setting implies --enable-rebuild-charta-         to the configure command. This setting implies --enable-rebuild-charta-
473         bles.         bles.  You  should  only  use  it if you know that you are in an EBCDIC
474           environment (for example, an IBM mainframe operating system).
475    
476    
477  SEE ALSO  SEE ALSO
# Line 482  AUTHOR Line 488  AUTHOR
488    
489  REVISION  REVISION
490    
491         Last updated: 05 June 2007         Last updated: 30 July 2007
492         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
493  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
494    
# Line 1559  INFORMATION ABOUT A PATTERN Line 1565  INFORMATION ABOUT A PATTERN
1565         Return a copy of the options with which the pattern was  compiled.  The         Return a copy of the options with which the pattern was  compiled.  The
1566         fourth  argument  should  point to an unsigned long int variable. These         fourth  argument  should  point to an unsigned long int variable. These
1567         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1568         by any top-level option settings within the pattern itself.         by any top-level option settings at the start of the pattern itself. In
1569           other words, they are the options that will be in force  when  matching
1570           starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1571           the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1572           and PCRE_EXTENDED.
1573    
1574         A  pattern  is  automatically  anchored by PCRE if all of its top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1575         alternatives begin with one of the following:         alternatives begin with one of the following:
# Line 2050  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2060  MATCHING A PATTERN: THE TRADITIONAL FUNC
2060         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2061         description above.         description above.
2062    
          PCRE_ERROR_NULLWSLIMIT    (-22)  
   
        When a group that can match an empty  substring  is  repeated  with  an  
        unbounded  upper  limit, the subject position at the start of the group  
        must be remembered, so that a test for an empty string can be made when  
        the  end  of the group is reached. Some workspace is required for this;  
        if it runs out, this error is given.  
   
2063           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
2064    
2065         An invalid combination of PCRE_NEWLINE_xxx options was given.         An invalid combination of PCRE_NEWLINE_xxx options was given.
2066    
2067         Error numbers -16 to -20 are not used by pcre_exec().         Error numbers -16 to -20 and -22 are not used by pcre_exec().
2068    
2069    
2070  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
# Line 2417  AUTHOR Line 2419  AUTHOR
2419    
2420  REVISION  REVISION
2421    
2422         Last updated: 13 June 2007         Last updated: 30 July 2007
2423         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
2424  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2425    

Legend:
Removed from v.185  
changed lines
  Added in v.197

  ViewVC Help
Powered by ViewVC 1.1.5