/[pcre]/code/trunk/doc/pcreunicode.3
ViewVC logotype

Diff of /code/trunk/doc/pcreunicode.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1260 by ph10, Sun Nov 11 20:27:03 2012 UTC revision 1261 by ph10, Wed Feb 27 16:27:01 2013 UTC
# Line 1  Line 1 
1  .TH PCREUNICODE 3 "11 November 2012" "PCRE 8.32"  .TH PCREUNICODE 3 "27 February 2013" "PCRE 8.33"
2  .SH NAME  .SH NAME
3  PCRE - Perl-compatible regular expressions  PCRE - Perl-compatible regular expressions
4  .SH "UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT"  .SH "UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT"
# Line 84  place. From release 7.3 of PCRE, the che Line 84  place. From release 7.3 of PCRE, the che
84  which are themselves derived from the Unicode specification. Earlier releases  which are themselves derived from the Unicode specification. Earlier releases
85  of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit  of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit
86  values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0  values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0
87  to U+10FFFF, excluding the surrogate area and the non-characters.  to U+10FFFF, excluding the surrogate area. (From release 8.33 the so-called
88    "non-character" code points are no longer excluded because Unicode corrigendum
89    #9 makes it clear that they should not be.)
90  .P  .P
91  Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,  Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16,
92  where they are used in pairs to encode codepoints with values greater than  where they are used in pairs to encode codepoints with values greater than
# Line 93  independently in the UTF-8 and UTF-32 en Line 95  independently in the UTF-8 and UTF-32 en
95  surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and  surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and
96  UTF-32.)  UTF-32.)
97  .P  .P
 Also excluded are the "Non-Character" code points, which are U+FDD0 to U+FDEF  
 and the last two code points in each plane, U+??FFFE and U+??FFFF.  
 .P  
98  If an invalid UTF-8 string is passed to PCRE, an error return is given. At  If an invalid UTF-8 string is passed to PCRE, an error return is given. At
99  compile time, the only additional information is the offset to the first byte  compile time, the only additional information is the offset to the first byte
100  of the failing character. The run-time functions \fBpcre_exec()\fP and  of the failing character. The run-time functions \fBpcre_exec()\fP and
# Line 128  to the relevant functions. Values other Line 127  to the relevant functions. Values other
127  U+D800 to U+DFFF are independent code points. Values in the surrogate range  U+D800 to U+DFFF are independent code points. Values in the surrogate range
128  must be used in pairs in the correct manner.  must be used in pairs in the correct manner.
129  .P  .P
 Excluded are the "Non-Character" code points, which are U+FDD0 to U+FDEF  
 and the last two code points in each plane, U+??FFFE and U+??FFFF.  
 .P  
130  If an invalid UTF-16 string is passed to PCRE, an error return is given. At  If an invalid UTF-16 string is passed to PCRE, an error return is given. At
131  compile time, the only additional information is the offset to the first data  compile time, the only additional information is the offset to the first data
132  unit of the failing character. The run-time functions \fBpcre16_exec()\fP and  unit of the failing character. The run-time functions \fBpcre16_exec()\fP and
# Line 152  However, if an invalid string is passed, Line 148  However, if an invalid string is passed,
148  When you set the PCRE_UTF32 flag, the strings of 32-bit data units that are  When you set the PCRE_UTF32 flag, the strings of 32-bit data units that are
149  passed as patterns and subjects are (by default) checked for validity on entry  passed as patterns and subjects are (by default) checked for validity on entry
150  to the relevant functions.  This check allows only values in the range U+0  to the relevant functions.  This check allows only values in the range U+0
151  to U+10FFFF, excluding the surrogate area U+D800 to U+DFFF, and the  to U+10FFFF, excluding the surrogate area U+D800 to U+DFFF.
 "Non-Character" code points, which are U+FDD0 to U+FDEF and the last two  
 characters in each plane, U+??FFFE and U+??FFFF.  
152  .P  .P
153  If an invalid UTF-32 string is passed to PCRE, an error return is given. At  If an invalid UTF-32 string is passed to PCRE, an error return is given. At
154  compile time, the only additional information is the offset to the first data  compile time, the only additional information is the offset to the first data
# Line 250  Cambridge CB2 3QH, England. Line 244  Cambridge CB2 3QH, England.
244  .rs  .rs
245  .sp  .sp
246  .nf  .nf
247  Last updated: 11 November 2012  Last updated: 27 February 2013
248  Copyright (c) 1997-2012 University of Cambridge.  Copyright (c) 1997-2013 University of Cambridge.
249  .fi  .fi

Legend:
Removed from v.1260  
changed lines
  Added in v.1261

  ViewVC Help
Powered by ViewVC 1.1.5