/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1403 by ph10, Tue May 28 09:13:59 2013 UTC revision 1404 by ph10, Tue Nov 19 15:36:57 2013 UTC
# Line 116  appearance causes an error. Line 116  appearance causes an error.
116  Unicode property support  Unicode property support
117  </b><br>  </b><br>
118  <P>  <P>
119  Another special sequence that may appear at the start of a pattern is  Another special sequence that may appear at the start of a pattern is (*UCP).
 <pre>  
   (*UCP)  
 </pre>  
120  This has the same effect as setting the PCRE_UCP option: it causes sequences  This has the same effect as setting the PCRE_UCP option: it causes sequences
121  such as \d and \w to use Unicode properties to determine character types,  such as \d and \w to use Unicode properties to determine character types,
122  instead of recognizing only characters with codes less than 128 via a lookup  instead of recognizing only characters with codes less than 128 via a lookup
123  table.  table.
124  </P>  </P>
125  <br><b>  <br><b>
126    Disabling auto-possessification
127    </b><br>
128    <P>
129    If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
130    the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making
131    quantifiers possessive when what follows cannot match the repeated item. For
132    example, by default a+b is treated as a++b. For more details, see the
133    <a href="pcreapi.html"><b>pcreapi</b></a>
134    documentation.
135    </P>
136    <br><b>
137  Disabling start-up optimizations  Disabling start-up optimizations
138  </b><br>  </b><br>
139  <P>  <P>
140  If a pattern starts with (*NO_START_OPT), it has the same effect as setting the  If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
141  PCRE_NO_START_OPTIMIZE option either at compile or matching time.  PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables
142    several optimizations for quickly reaching "no match" results. For more
143    details, see the
144    <a href="pcreapi.html"><b>pcreapi</b></a>
145    documentation.
146  <a name="newlines"></a></P>  <a name="newlines"></a></P>
147  <br><b>  <br><b>
148  Newline conventions  Newline conventions
# Line 193  pattern of the form Line 205  pattern of the form
205    (*LIMIT_RECURSION=d)    (*LIMIT_RECURSION=d)
206  </pre>  </pre>
207  where d is any number of decimal digits. However, the value of the setting must  where d is any number of decimal digits. However, the value of the setting must
208  be less than the value set by the caller of <b>pcre_exec()</b> for it to have  be less than the value set (or defaulted) by the caller of <b>pcre_exec()</b>
209  any effect. In other words, the pattern writer can lower the limit set by the  for it to have any effect. In other words, the pattern writer can lower the
210  programmer, but not raise it. If there is more than one setting of one of these  limits set by the programmer, but not raise them. If there is more than one
211  limits, the lower value is used.  setting of one of these limits, the lower value is used.
212  </P>  </P>
213  <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>  <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
214  <P>  <P>
# Line 283  backslash. All other characters (in part Line 295  backslash. All other characters (in part
295  greater than 127) are treated as literals.  greater than 127) are treated as literals.
296  </P>  </P>
297  <P>  <P>
298  If a pattern is compiled with the PCRE_EXTENDED option, white space in the  If a pattern is compiled with the PCRE_EXTENDED option, most white space in the
299  pattern (other than in a character class) and characters between a # outside  pattern (other than in a character class), and characters between a # outside a
300  a character class and the next newline are ignored. An escaping backslash can  character class and the next newline, inclusive, are ignored. An escaping
301  be used to include a white space or # character as part of the pattern.  backslash can be used to include a white space or # character as part of the
302    pattern.
303  </P>  </P>
304  <P>  <P>
305  If you want to remove the special meaning from a sequence of characters, you  If you want to remove the special meaning from a sequence of characters, you
# Line 324  one of the following escape sequences th Line 337  one of the following escape sequences th
337    \n        linefeed (hex 0A)    \n        linefeed (hex 0A)
338    \r        carriage return (hex 0D)    \r        carriage return (hex 0D)
339    \t        tab (hex 09)    \t        tab (hex 09)
340      \0dd      character with octal code 0dd
341    \ddd      character with octal code ddd, or back reference    \ddd      character with octal code ddd, or back reference
342      \o{ddd..} character with octal code ddd..
343    \xhh      character with hex code hh    \xhh      character with hex code hh
344    \x{hhh..} character with hex code hhh.. (non-JavaScript mode)    \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
345    \uhhhh    character with hex code hhhh (JavaScript mode only)    \uhhhh    character with hex code hhhh (JavaScript mode only)
# Line 347  the EBCDIC letters are disjoint, \cZ bec Line 362  the EBCDIC letters are disjoint, \cZ bec
362  characters also generate different values.  characters also generate different values.
363  </P>  </P>
364  <P>  <P>
 By default, after \x, from zero to two hexadecimal digits are read (letters  
 can be in upper or lower case). Any number of hexadecimal digits may appear  
 between \x{ and }, but the character code is constrained as follows:  
 <pre>  
   8-bit non-UTF mode    less than 0x100  
   8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint  
   16-bit non-UTF mode   less than 0x10000  
   16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint  
   32-bit non-UTF mode   less than 0x80000000  
   32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint  
 </pre>  
 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called  
 "surrogate" codepoints), and 0xffef.  
 </P>  
 <P>  
 If characters other than hexadecimal digits appear between \x{ and }, or if  
 there is no terminating }, this form of escape is not recognized. Instead, the  
 initial \x will be interpreted as a basic hexadecimal escape, with no  
 following digits, giving a character whose value is zero.  
 </P>  
 <P>  
 If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is  
 as just described only when it is followed by two hexadecimal digits.  
 Otherwise, it matches a literal "x" character. In JavaScript mode, support for  
 code points greater than 256 is provided by \u, which must be followed by  
 four hexadecimal digits; otherwise it matches a literal "u" character.  
 Character codes specified by \u in JavaScript mode are constrained in the same  
 was as those specified by \x in non-JavaScript mode.  
 </P>  
 <P>  
 Characters whose value is less than 256 can be defined by either of the two  
 syntaxes for \x (or by \u in JavaScript mode). There is no difference in the  
 way they are handled. For example, \xdc is exactly the same as \x{dc} (or  
 \u00dc in JavaScript mode).  
 </P>  
 <P>  
365  After \0 up to two further octal digits are read. If there are fewer than two  After \0 up to two further octal digits are read. If there are fewer than two
366  digits, just those that are present are used. Thus the sequence \0\x\07  digits, just those that are present are used. Thus the sequence \0\x\07
367  specifies two binary zeros followed by a BEL character (code value 7). Make  specifies two binary zeros followed by a BEL character (code value 7). Make
# Line 390  sure you supply two digits after the ini Line 369  sure you supply two digits after the ini
369  follows is itself an octal digit.  follows is itself an octal digit.
370  </P>  </P>
371  <P>  <P>
372  The handling of a backslash followed by a digit other than 0 is complicated.  The escape \o must be followed by a sequence of octal digits, enclosed in
373  Outside a character class, PCRE reads it and any following digits as a decimal  braces. An error occurs if this is not the case. This escape is a recent
374  number. If the number is less than 10, or if there have been at least that many  addition to Perl; it provides way of specifying character code points as octal
375    numbers greater than 0777, and it also allows octal numbers and back references
376    to be unambiguously specified.
377    </P>
378    <P>
379    For greater clarity and unambiguity, it is best to avoid following \ by a
380    digit greater than zero. Instead, use \o{} or \x{} to specify character
381    numbers, and \g{} to specify back references. The following paragraphs
382    describe the old, ambiguous syntax.
383    </P>
384    <P>
385    The handling of a backslash followed by a digit other than 0 is complicated,
386    and Perl has changed in recent releases, causing PCRE also to change. Outside a
387    character class, PCRE reads the digit and any following digits as a decimal
388    number. If the number is less than 8, or if there have been at least that many
389  previous capturing left parentheses in the expression, the entire sequence is  previous capturing left parentheses in the expression, the entire sequence is
390  taken as a <i>back reference</i>. A description of how this works is given  taken as a <i>back reference</i>. A description of how this works is given
391  <a href="#backreferences">later,</a>  <a href="#backreferences">later,</a>
# Line 400  following the discussion of Line 393  following the discussion of
393  <a href="#subpattern">parenthesized subpatterns.</a>  <a href="#subpattern">parenthesized subpatterns.</a>
394  </P>  </P>
395  <P>  <P>
396  Inside a character class, or if the decimal number is greater than 9 and there  Inside a character class, or if the decimal number following \ is greater than
397  have not been that many capturing subpatterns, PCRE re-reads up to three octal  7 and there have not been that many capturing subpatterns, PCRE handles \8 and
398  digits following the backslash, and uses them to generate a data character. Any  \9 as the literal characters "8" and "9", and otherwise re-reads up to three
399  subsequent digits stand for themselves. The value of the character is  octal digits following the backslash, using them to generate a data character.
400  constrained in the same way as characters specified in hexadecimal.  Any subsequent digits stand for themselves. For example:
 For example:  
401  <pre>  <pre>
402    \040   is another way of writing an ASCII space    \040   is another way of writing an ASCII space
403    \40    is the same, provided there are fewer than 40 previous capturing subpatterns    \40    is the same, provided there are fewer than 40 previous capturing subpatterns
# Line 415  For example: Line 407  For example:
407    \0113  is a tab followed by the character "3"    \0113  is a tab followed by the character "3"
408    \113   might be a back reference, otherwise the character with octal code 113    \113   might be a back reference, otherwise the character with octal code 113
409    \377   might be a back reference, otherwise the value 255 (decimal)    \377   might be a back reference, otherwise the value 255 (decimal)
410    \81    is either a back reference, or a binary zero followed by the two characters "8" and "1"    \81    is either a back reference, or the two characters "8" and "1"
411    </pre>
412    Note that octal values of 100 or greater that are specified using this syntax
413    must not be introduced by a leading zero, because no more than three octal
414    digits are ever read.
415    </P>
416    <P>
417    By default, after \x that is not followed by {, from zero to two hexadecimal
418    digits are read (letters can be in upper or lower case). Any number of
419    hexadecimal digits may appear between \x{ and }. If a character other than
420    a hexadecimal digit appears between \x{ and }, or if there is no terminating
421    }, an error occurs.
422    </P>
423    <P>
424    If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
425    as just described only when it is followed by two hexadecimal digits.
426    Otherwise, it matches a literal "x" character. In JavaScript mode, support for
427    code points greater than 256 is provided by \u, which must be followed by
428    four hexadecimal digits; otherwise it matches a literal "u" character.
429    </P>
430    <P>
431    Characters whose value is less than 256 can be defined by either of the two
432    syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
433    way they are handled. For example, \xdc is exactly the same as \x{dc} (or
434    \u00dc in JavaScript mode).
435    </P>
436    <br><b>
437    Constraints on character values
438    </b><br>
439    <P>
440    Characters that are specified using octal or hexadecimal numbers are
441    limited to certain values, as follows:
442    <pre>
443      8-bit non-UTF mode    less than 0x100
444      8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
445      16-bit non-UTF mode   less than 0x10000
446      16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
447      32-bit non-UTF mode   less than 0x100000000
448      32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
449  </pre>  </pre>
450  Note that octal values of 100 or greater must not be introduced by a leading  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
451  zero, because no more than three octal digits are ever read.  "surrogate" codepoints), and 0xffef.
452  </P>  </P>
453    <br><b>
454    Escape sequences in character classes
455    </b><br>
456  <P>  <P>
457  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
458  and outside character classes. In addition, inside a character class, \b is  and outside character classes. In addition, inside a character class, \b is
# Line 498  matching point is at the end of the subj Line 531  matching point is at the end of the subj
531  there is no character to match.  there is no character to match.
532  </P>  </P>
533  <P>  <P>
534  For compatibility with Perl, \s does not match the VT character (code 11).  For compatibility with Perl, \s did not used to match the VT character (code
535  This makes it different from the the POSIX "space" class. The \s characters  11), which made it different from the the POSIX "space" class. However, Perl
536  are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is  added VT at release 5.18, and PCRE followed suit at release 8.34. The default
537  included in a Perl script, \s may match the VT character. In PCRE, it never  \s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space
538  does.  (32), which are defined as white space in the "C" locale. This list may vary if
539    locale-specific matching is taking place; in particular, in some locales the
540    "non-breaking space" character (\xA0) is recognized as white space.
541  </P>  </P>
542  <P>  <P>
543  A "word" character is an underscore or any character that is a letter or digit.  A "word" character is an underscore or any character that is a letter or digit.
# Line 513  place (see Line 548  place (see
548  in the  in the
549  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
550  page). For example, in a French locale such as "fr_FR" in Unix-like systems,  page). For example, in a French locale such as "fr_FR" in Unix-like systems,
551  or "french" in Windows, some character codes greater than 128 are used for  or "french" in Windows, some character codes greater than 127 are used for
552  accented letters, and these are then matched by \w. The use of locales with  accented letters, and these are then matched by \w. The use of locales with
553  Unicode is discouraged.  Unicode is discouraged.
554  </P>  </P>
555  <P>  <P>
556  By default, in a UTF mode, characters with values greater than 128 never match  By default, characters whose code points are greater than 127 never match \d,
557  \d, \s, or \w, and always match \D, \S, and \W. These sequences retain  \s, or \w, and always match \D, \S, and \W, although this may vary for
558  their original meanings from before UTF support was available, mainly for  characters in the range 128-255 when locale-specific matching is happening.
559  efficiency reasons. However, if PCRE is compiled with Unicode property support,  These escape sequences retain their original meanings from before Unicode
560  and the PCRE_UCP option is set, the behaviour is changed so that Unicode  support was available, mainly for efficiency reasons. If PCRE is compiled with
561  properties are used to determine character types, as follows:  Unicode property support, and the PCRE_UCP option is set, the behaviour is
562  <pre>  changed so that Unicode properties are used to determine character types, as
563    \d  any character that \p{Nd} matches (decimal digit)  follows:
564    \s  any character that \p{Z} matches, plus HT, LF, FF, CR  <pre>
565    \w  any character that \p{L} or \p{N} matches, plus underscore    \d  any character that matches \p{Nd} (decimal digit)
566      \s  any character that matches \p{Z} or \h or \v
567      \w  any character that matches \p{L} or \p{N}, plus underscore
568  </pre>  </pre>
569  The upper case escapes match the inverse sets of characters. Note that \d  The upper case escapes match the inverse sets of characters. Note that \d
570  matches only decimal digits, whereas \w matches any Unicode digit, as well as  matches only decimal digits, whereas \w matches any Unicode digit, as well as
# Line 538  is noticeably slower when PCRE_UCP is se Line 575  is noticeably slower when PCRE_UCP is se
575  <P>  <P>
576  The sequences \h, \H, \v, and \V are features that were added to Perl at  The sequences \h, \H, \v, and \V are features that were added to Perl at
577  release 5.10. In contrast to the other sequences, which match only ASCII  release 5.10. In contrast to the other sequences, which match only ASCII
578  characters by default, these always match certain high-valued codepoints,  characters by default, these always match certain high-valued code points,
579  whether or not PCRE_UCP is set. The horizontal space characters are:  whether or not PCRE_UCP is set. The horizontal space characters are:
580  <pre>  <pre>
581    U+0009     Horizontal tab (HT)    U+0009     Horizontal tab (HT)
# Line 913  PCRE's additional properties Line 950  PCRE's additional properties
950  <P>  <P>
951  As well as the standard Unicode properties described above, PCRE supports four  As well as the standard Unicode properties described above, PCRE supports four
952  more that make it possible to convert traditional escape sequences such as \w  more that make it possible to convert traditional escape sequences such as \w
953  and \s and POSIX character classes to use Unicode properties. PCRE uses these  and \s to use Unicode properties. PCRE uses these non-standard, non-Perl
954  non-standard, non-Perl properties internally when PCRE_UCP is set. However,  properties internally when PCRE_UCP is set. However, they may also be used
955  they may also be used explicitly. These properties are:  explicitly. These properties are:
956  <pre>  <pre>
957    Xan   Any alphanumeric character    Xan   Any alphanumeric character
958    Xps   Any POSIX space character    Xps   Any POSIX space character
# Line 925  they may also be used explicitly. These Line 962  they may also be used explicitly. These
962  Xan matches characters that have either the L (letter) or the N (number)  Xan matches characters that have either the L (letter) or the N (number)
963  property. Xps matches the characters tab, linefeed, vertical tab, form feed, or  property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
964  carriage return, and any other character that has the Z (separator) property.  carriage return, and any other character that has the Z (separator) property.
965  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the  Xsp is the same as Xps; it used to exclude vertical tab, for Perl
966  same characters as Xan, plus underscore.  compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd
967    matches the same characters as Xan, plus underscore.
968  </P>  </P>
969  <P>  <P>
970  There is another non-standard property, Xuc, which matches any character that  There is another non-standard property, Xuc, which matches any character that
# Line 1218  The minus (hyphen) character can be used Line 1256  The minus (hyphen) character can be used
1256  character class. For example, [d-m] matches any letter between d and m,  character class. For example, [d-m] matches any letter between d and m,
1257  inclusive. If a minus character is required in a class, it must be escaped with  inclusive. If a minus character is required in a class, it must be escaped with
1258  a backslash or appear in a position where it cannot be interpreted as  a backslash or appear in a position where it cannot be interpreted as
1259  indicating a range, typically as the first or last character in the class.  indicating a range, typically as the first or last character in the class, or
1260    immediately after a range. For example, [b-d-z] matches letters in the range b
1261    to d, a hyphen character, or z.
1262  </P>  </P>
1263  <P>  <P>
1264  It is not possible to have the literal character "]" as the end character of a  It is not possible to have the literal character "]" as the end character of a
# Line 1230  followed by two other characters. The oc Line 1270  followed by two other characters. The oc
1270  "]" can also be used to end a range.  "]" can also be used to end a range.
1271  </P>  </P>
1272  <P>  <P>
1273    An error is generated if a POSIX character class (see below) or an escape
1274    sequence other than one that defines a single character appears at a point
1275    where a range ending character is expected. For example, [z-\xff] is valid,
1276    but [A-\d] and [A-[:digit:]] are not.
1277    </P>
1278    <P>
1279  Ranges operate in the collating sequence of character values. They can also be  Ranges operate in the collating sequence of character values. They can also be
1280  used for characters specified numerically, for example [\000-\037]. Ranges  used for characters specified numerically, for example [\000-\037]. Ranges
1281  can include any characters that are valid for the current mode.  can include any characters that are valid for the current mode.
# Line 1294  are: Line 1340  are:
1340    lower    lower case letters    lower    lower case letters
1341    print    printing characters, including space    print    printing characters, including space
1342    punct    printing characters, excluding letters and digits and space    punct    printing characters, excluding letters and digits and space
1343    space    white space (not quite the same as \s)    space    white space (the same as \s from PCRE 8.34)
1344    upper    upper case letters    upper    upper case letters
1345    word     "word" characters (same as \w)    word     "word" characters (same as \w)
1346    xdigit   hexadecimal digits    xdigit   hexadecimal digits
1347  </pre>  </pre>
1348  The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and  The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1349  space (32). Notice that this list includes the VT character (code 11). This  and space (32). If locale-specific matching is taking place, there may be
1350  makes "space" different to \s, which does not include VT (for Perl  additional space characters. "Space" used to be different to \s, which did not
1351  compatibility).  include VT, for Perl compatibility. However, Perl changed at release 5.18, and
1352    PCRE followed at release 8.34. "Space" and \s now match the same set of
1353    characters.
1354  </P>  </P>
1355  <P>  <P>
1356  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
# Line 1316  syntax [.ch.] and [=ch=] where "ch" is a Line 1364  syntax [.ch.] and [=ch=] where "ch" is a
1364  supported, and an error is given if they are encountered.  supported, and an error is given if they are encountered.
1365  </P>  </P>
1366  <P>  <P>
1367  By default, in UTF modes, characters with values greater than 128 do not match  By default, characters with values greater than 128 do not match any of the
1368  any of the POSIX character classes. However, if the PCRE_UCP option is passed  POSIX character classes. However, if the PCRE_UCP option is passed to
1369  to <b>pcre_compile()</b>, some of the classes are changed so that Unicode  <b>pcre_compile()</b>, some of the classes are changed so that Unicode character
1370  character properties are used. This is achieved by replacing the POSIX classes  properties are used. This is achieved by replacing certain POSIX classes by
1371  by other sequences, as follows:  other sequences, as follows:
1372  <pre>  <pre>
1373    [:alnum:]  becomes  \p{Xan}    [:alnum:]  becomes  \p{Xan}
1374    [:alpha:]  becomes  \p{L}    [:alpha:]  becomes  \p{L}
# Line 1331  by other sequences, as follows: Line 1379  by other sequences, as follows:
1379    [:upper:]  becomes  \p{Lu}    [:upper:]  becomes  \p{Lu}
1380    [:word:]   becomes  \p{Xwd}    [:word:]   becomes  \p{Xwd}
1381  </pre>  </pre>
1382  Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX  Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX
1383  classes are unchanged, and match only characters with code points less than  classes are handled specially in UCP mode:
1384  128.  </P>
1385    <P>
1386    [:graph:]
1387    This matches characters that have glyphs that mark the page when printed. In
1388    Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
1389    properties, except for:
1390    <pre>
1391      U+061C           Arabic Letter Mark
1392      U+180E           Mongolian Vowel Separator
1393      U+2066 - U+2069  Various "isolate"s
1394    
1395    </PRE>
1396    </P>
1397    <P>
1398    [:print:]
1399    This matches the same characters as [:graph:] plus space characters that are
1400    not controls, that is, characters with the Zs property.
1401    </P>
1402    <P>
1403    [:punct:]
1404    This matches all characters that have the Unicode P (punctuation) property,
1405    plus those characters whose code points are less than 128 that have the S
1406    (Symbol) property.
1407    </P>
1408    <P>
1409    The other POSIX classes are unchanged, and match only characters with code
1410    points less than 128.
1411  </P>  </P>
1412  <br><a name="SEC11" href="#TOC1">VERTICAL BAR</a><br>  <br><a name="SEC11" href="#TOC1">VERTICAL BAR</a><br>
1413  <P>  <P>
# Line 1535  and Line 1609  and
1609  can be made by name as well as by number.  can be made by name as well as by number.
1610  </P>  </P>
1611  <P>  <P>
1612  Names consist of up to 32 alphanumeric characters and underscores. Named  Names consist of up to 32 alphanumeric characters and underscores, but must
1613  capturing parentheses are still allocated numbers as well as names, exactly as  start with a non-digit. Named capturing parentheses are still allocated numbers
1614  if the names were not present. The PCRE API provides function calls for  as well as names, exactly as if the names were not present. The PCRE API
1615  extracting the name-to-number translation table from a compiled pattern. There  provides function calls for extracting the name-to-number translation table
1616  is also a convenience function for extracting a captured substring by name.  from a compiled pattern. There is also a convenience function for extracting a
1617    captured substring by name.
1618  </P>  </P>
1619  <P>  <P>
1620  By default, a name must be unique within a pattern, but it is possible to relax  By default, a name must be unique within a pattern, but it is possible to relax
# Line 1568  matched. This saves searching to find wh Line 1643  matched. This saves searching to find wh
1643  </P>  </P>
1644  <P>  <P>
1645  If you make a back reference to a non-unique named subpattern from elsewhere in  If you make a back reference to a non-unique named subpattern from elsewhere in
1646  the pattern, the one that corresponds to the first occurrence of the name is  the pattern, the subpatterns to which the name refers are checked in the order
1647  used. In the absence of duplicate numbers (see the previous section) this is  in which they appear in the overall pattern. The first one that is set is used
1648  the one with the lowest number. If you use a named reference in a condition  for the reference. For example, this pattern matches both "foofoo" and
1649    "barbar" but not "foobar" or "barfoo":
1650    <pre>
1651      (?:(?&#60;n&#62;foo)|(?&#60;n&#62;bar))\k&#60;n&#62;
1652    
1653    </PRE>
1654    </P>
1655    <P>
1656    If you make a subroutine call to a non-unique named subpattern, the one that
1657    corresponds to the first occurrence of the name is used. In the absence of
1658    duplicate numbers (see the previous section) this is the one with the lowest
1659    number.
1660    </P>
1661    <P>
1662    If you use a named reference in a condition
1663  test (see the  test (see the
1664  <a href="#conditions">section about conditions</a>  <a href="#conditions">section about conditions</a>
1665  below), either to check whether a subpattern has matched, or to check for  below), either to check whether a subpattern has matched, or to check for
# Line 1585  documentation. Line 1674  documentation.
1674  <b>Warning:</b> You cannot use different names to distinguish between two  <b>Warning:</b> You cannot use different names to distinguish between two
1675  subpatterns with the same number because PCRE uses only the numbers when  subpatterns with the same number because PCRE uses only the numbers when
1676  matching. For this reason, an error is given at compile time if different names  matching. For this reason, an error is given at compile time if different names
1677  are given to subpatterns with the same number. However, you can give the same  are given to subpatterns with the same number. However, you can always give the
1678  name to subpatterns with the same number, even when PCRE_DUPNAMES is not set.  same name to subpatterns with the same number, even when PCRE_DUPNAMES is not
1679    set.
1680  </P>  </P>
1681  <br><a name="SEC16" href="#TOC1">REPETITION</a><br>  <br><a name="SEC16" href="#TOC1">REPETITION</a><br>
1682  <P>  <P>
# Line 2252  Checking for a used subpattern by name Line 2342  Checking for a used subpattern by name
2342  <P>  <P>
2343  Perl uses the syntax (?(&#60;name&#62;)...) or (?('name')...) to test for a used  Perl uses the syntax (?(&#60;name&#62;)...) or (?('name')...) to test for a used
2344  subpattern by name. For compatibility with earlier versions of PCRE, which had  subpattern by name. For compatibility with earlier versions of PCRE, which had
2345  this facility before Perl, the syntax (?(name)...) is also recognized. However,  this facility before Perl, the syntax (?(name)...) is also recognized.
 there is a possible ambiguity with this syntax, because subpattern names may  
 consist entirely of digits. PCRE looks first for a named subpattern; if it  
 cannot find one and the name consists entirely of digits, PCRE looks for a  
 subpattern of that number, which must be greater than zero. Using subpattern  
 names that consist entirely of digits is not recommended.  
2346  </P>  </P>
2347  <P>  <P>
2348  Rewriting the above example to use a named subpattern gives this:  Rewriting the above example to use a named subpattern gives this:
# Line 2674  During matching, when PCRE reaches a cal Line 2759  During matching, when PCRE reaches a cal
2759  called. It is provided with the number of the callout, the position in the  called. It is provided with the number of the callout, the position in the
2760  pattern, and, optionally, one item of data originally supplied by the caller of  pattern, and, optionally, one item of data originally supplied by the caller of
2761  the matching function. The callout function may cause matching to proceed, to  the matching function. The callout function may cause matching to proceed, to
2762  backtrack, or to fail altogether. A complete description of the interface to  backtrack, or to fail altogether.
2763  the callout function is given in the  </P>
2764    <P>
2765    By default, PCRE implements a number of optimizations at compile time and
2766    matching time, and one side-effect is that sometimes callouts are skipped. If
2767    you need all possible callouts to happen, you need to set options that disable
2768    the relevant optimizations. More details, and a complete description of the
2769    interface to the callout function, are given in the
2770  <a href="pcrecallout.html"><b>pcrecallout</b></a>  <a href="pcrecallout.html"><b>pcrecallout</b></a>
2771  documentation.  documentation.
2772  <a name="backtrackcontrol"></a></P>  <a name="backtrackcontrol"></a></P>
# Line 3026  example: Line 3117  example:
3117  <pre>  <pre>
3118    ...(*COMMIT)(*PRUNE)...    ...(*COMMIT)(*PRUNE)...
3119  </pre>  </pre>
3120  If there is a matching failure to the right, backtracking onto (*PRUNE) cases  If there is a matching failure to the right, backtracking onto (*PRUNE) causes
3121  it to be triggered, and its action is taken. There can never be a backtrack  it to be triggered, and its action is taken. There can never be a backtrack
3122  onto (*COMMIT).  onto (*COMMIT).
3123  <a name="btrepeat"></a></P>  <a name="btrepeat"></a></P>
# Line 3109  Cambridge CB2 3QH, England. Line 3200  Cambridge CB2 3QH, England.
3200  </P>  </P>
3201  <br><a name="SEC29" href="#TOC1">REVISION</a><br>  <br><a name="SEC29" href="#TOC1">REVISION</a><br>
3202  <P>  <P>
3203  Last updated: 26 April 2013  Last updated: 12 November 2013
3204  <br>  <br>
3205  Copyright &copy; 1997-2013 University of Cambridge.  Copyright &copy; 1997-2013 University of Cambridge.
3206  <br>  <br>

Legend:
Removed from v.1403  
changed lines
  Added in v.1404

  ViewVC Help
Powered by ViewVC 1.1.5