/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 77 by nigel, Sat Feb 24 21:40:45 2007 UTC revision 91 by nigel, Sat Feb 24 21:41:34 2007 UTC
# Line 1  Line 1 
1  .TH PCRE 3  .TH PCREPATTERN 3
2  .SH NAME  .SH NAME
3  PCRE - Perl-compatible regular expressions  PCRE - Perl-compatible regular expressions
4  .SH "PCRE REGULAR EXPRESSION DETAILS"  .SH "PCRE REGULAR EXPRESSION DETAILS"
# Line 96  The following sections describe the use Line 96  The following sections describe the use
96  .rs  .rs
97  .sp  .sp
98  The backslash character has several uses. Firstly, if it is followed by a  The backslash character has several uses. Firstly, if it is followed by a
99  non-alphanumeric character, it takes away any special meaning that character may  non-alphanumeric character, it takes away any special meaning that character
100  have. This use of backslash as an escape character applies both inside and  may have. This use of backslash as an escape character applies both inside and
101  outside character classes.  outside character classes.
102  .P  .P
103  For example, if you want to match a * character, you write \e* in the pattern.  For example, if you want to match a * character, you write \e* in the pattern.
# Line 108  particular, if you want to match a backs Line 108  particular, if you want to match a backs
108  .P  .P
109  If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the  If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
110  pattern (other than in a character class) and characters between a # outside  pattern (other than in a character class) and characters between a # outside
111  a character class and the next newline character are ignored. An escaping  a character class and the next newline are ignored. An escaping backslash can
112  backslash can be used to include a whitespace or # character as part of the  be used to include a whitespace or # character as part of the pattern.
 pattern.  
113  .P  .P
114  If you want to remove the special meaning from a sequence of characters, you  If you want to remove the special meaning from a sequence of characters, you
115  can do so by putting them between \eQ and \eE. This is different from Perl in  can do so by putting them between \eQ and \eE. This is different from Perl in
# Line 148  represents: Line 147  represents:
147    \et        tab (hex 09)    \et        tab (hex 09)
148    \eddd      character with octal code ddd, or backreference    \eddd      character with octal code ddd, or backreference
149    \exhh      character with hex code hh    \exhh      character with hex code hh
150    \ex{hhh..} character with hex code hhh... (UTF-8 mode only)    \ex{hhh..} character with hex code hhh..
151  .sp  .sp
152  The precise effect of \ecx is as follows: if x is a lower case letter, it  The precise effect of \ecx is as follows: if x is a lower case letter, it
153  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 156  Thus \ecz becomes hex 1A, but \ec{ becom Line 155  Thus \ecz becomes hex 1A, but \ec{ becom
155  7B.  7B.
156  .P  .P
157  After \ex, from zero to two hexadecimal digits are read (letters can be in  After \ex, from zero to two hexadecimal digits are read (letters can be in
158  upper or lower case). In UTF-8 mode, any number of hexadecimal digits may  upper or lower case). Any number of hexadecimal digits may appear between \ex{
159  appear between \ex{ and }, but the value of the character code must be less  and }, but the value of the character code must be less than 256 in non-UTF-8
160  than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters  mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value
161  other than hexadecimal digits appear between \ex{ and }, or if there is no  is 7FFFFFFF). If characters other than hexadecimal digits appear between \ex{
162  terminating }, this form of escape is not recognized. Instead, the initial  and }, or if there is no terminating }, this form of escape is not recognized.
163  \ex will be interpreted as a basic hexadecimal escape, with no following  Instead, the initial \ex will be interpreted as a basic hexadecimal escape,
164  digits, giving a character whose value is zero.  with no following digits, giving a character whose value is zero.
165  .P  .P
166  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
167  syntaxes for \ex when PCRE is in UTF-8 mode. There is no difference in the  syntaxes for \ex. There is no difference in the way they are handled. For
168  way they are handled. For example, \exdc is exactly the same as \ex{dc}.  example, \exdc is exactly the same as \ex{dc}.
169  .P  .P
170  After \e0 up to two further octal digits are read. In both cases, if there  After \e0 up to two further octal digits are read. If there are fewer than two
171  are fewer than two digits, just those that are present are used. Thus the  digits, just those that are present are used. Thus the sequence \e0\ex\e07
172  sequence \e0\ex\e07 specifies two binary zeros followed by a BEL character  specifies two binary zeros followed by a BEL character (code value 7). Make
173  (code value 7). Make sure you supply two digits after the initial zero if the  sure you supply two digits after the initial zero if the pattern character that
174  pattern character that follows is itself an octal digit.  follows is itself an octal digit.
175  .P  .P
176  The handling of a backslash followed by a digit other than 0 is complicated.  The handling of a backslash followed by a digit other than 0 is complicated.
177  Outside a character class, PCRE reads it and any following digits as a decimal  Outside a character class, PCRE reads it and any following digits as a decimal
# Line 191  parenthesized subpatterns. Line 190  parenthesized subpatterns.
190  .P  .P
191  Inside a character class, or if the decimal number is greater than 9 and there  Inside a character class, or if the decimal number is greater than 9 and there
192  have not been that many capturing subpatterns, PCRE re-reads up to three octal  have not been that many capturing subpatterns, PCRE re-reads up to three octal
193  digits following the backslash, and generates a single byte from the least  digits following the backslash, ane uses them to generate a data character. Any
194  significant 8 bits of the value. Any subsequent digits stand for themselves.  subsequent digits stand for themselves. In non-UTF-8 mode, the value of a
195  For example:  character specified in octal must be less than \e400. In UTF-8 mode, values up
196    to \e777 are permitted. For example:
197  .sp  .sp
198    \e040   is another way of writing a space    \e040   is another way of writing a space
199  .\" JOIN  .\" JOIN
# Line 218  For example: Line 218  For example:
218  Note that octal values of 100 or greater must not be introduced by a leading  Note that octal values of 100 or greater must not be introduced by a leading
219  zero, because no more than three octal digits are ever read.  zero, because no more than three octal digits are ever read.
220  .P  .P
221  All the sequences that define a single byte value or a single UTF-8 character  All the sequences that define a single character value can be used both inside
222  (in UTF-8 mode) can be used both inside and outside character classes. In  and outside character classes. In addition, inside a character class, the
223  addition, inside a character class, the sequence \eb is interpreted as the  sequence \eb is interpreted as the backspace character (hex 08), and the
224  backspace character (hex 08), and the sequence \eX is interpreted as the  sequence \eX is interpreted as the character "X". Outside a character class,
225  character "X". Outside a character class, these sequences have different  these sequences have different meanings
 meanings  
226  .\" HTML <a href="#uniextseq">  .\" HTML <a href="#uniextseq">
227  .\" </a>  .\" </a>
228  (see below).  (see below).
# Line 253  there is no character to match. Line 252  there is no character to match.
252  .P  .P
253  For compatibility with Perl, \es does not match the VT character (code 11).  For compatibility with Perl, \es does not match the VT character (code 11).
254  This makes it different from the the POSIX "space" class. The \es characters  This makes it different from the the POSIX "space" class. The \es characters
255  are HT (9), LF (10), FF (12), CR (13), and space (32).  are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is
256    included in a Perl script, \es may match the VT character. In PCRE, it never
257    does.)
258  .P  .P
259  A "word" character is an underscore or any character less than 256 that is a  A "word" character is an underscore or any character less than 256 that is a
260  letter or digit. The definition of letters and digits is controlled by PCRE's  letter or digit. The definition of letters and digits is controlled by PCRE's
# Line 272  greater than 128 are used for accented l Line 273  greater than 128 are used for accented l
273  .P  .P
274  In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or  In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or
275  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode
276  character property support is available.  character property support is available. The use of locales with Unicode is
277    discouraged.
278  .  .
279  .  .
280  .\" HTML <a name="uniextseq"></a>  .\" HTML <a name="uniextseq"></a>
# Line 280  character property support is available. Line 282  character property support is available.
282  .rs  .rs
283  .sp  .sp
284  When PCRE is built with Unicode character property support, three additional  When PCRE is built with Unicode character property support, three additional
285  escape sequences to match generic character types are available when UTF-8 mode  escape sequences to match character properties are available when UTF-8 mode
286  is selected. They are:  is selected. They are:
287  .sp  .sp
288   \ep{\fIxx\fP}   a character with the \fIxx\fP property    \ep{\fIxx\fP}   a character with the \fIxx\fP property
289   \eP{\fIxx\fP}   a character without the \fIxx\fP property    \eP{\fIxx\fP}   a character without the \fIxx\fP property
290   \eX       an extended Unicode sequence    \eX       an extended Unicode sequence
291  .sp  .sp
292  The property names represented by \fIxx\fP above are limited to the  The property names represented by \fIxx\fP above are limited to the Unicode
293  Unicode general category properties. Each character has exactly one such  script names, the general category properties, and "Any", which matches any
294  property, specified by a two-letter abbreviation. For compatibility with Perl,  character (including newline). Other properties such as "InMusicalSymbols" are
295  negation can be specified by including a circumflex between the opening brace  not currently supported by PCRE. Note that \eP{Any} does not match any
296  and the property name. For example, \ep{^Lu} is the same as \eP{Lu}.  characters, so always causes a match failure.
297  .P  .P
298  If only one letter is specified with \ep or \eP, it includes all the properties  Sets of Unicode characters are defined as belonging to certain scripts. A
299  that start with that letter. In this case, in the absence of negation, the  character from one of these sets can be matched using a script name. For
300  curly brackets in the escape sequence are optional; these two examples have  example:
301  the same effect:  .sp
302      \ep{Greek}
303      \eP{Han}
304    .sp
305    Those that are not part of an identified script are lumped together as
306    "Common". The current list of scripts is:
307    .P
308    Arabic,
309    Armenian,
310    Bengali,
311    Bopomofo,
312    Braille,
313    Buginese,
314    Buhid,
315    Canadian_Aboriginal,
316    Cherokee,
317    Common,
318    Coptic,
319    Cypriot,
320    Cyrillic,
321    Deseret,
322    Devanagari,
323    Ethiopic,
324    Georgian,
325    Glagolitic,
326    Gothic,
327    Greek,
328    Gujarati,
329    Gurmukhi,
330    Han,
331    Hangul,
332    Hanunoo,
333    Hebrew,
334    Hiragana,
335    Inherited,
336    Kannada,
337    Katakana,
338    Kharoshthi,
339    Khmer,
340    Lao,
341    Latin,
342    Limbu,
343    Linear_B,
344    Malayalam,
345    Mongolian,
346    Myanmar,
347    New_Tai_Lue,
348    Ogham,
349    Old_Italic,
350    Old_Persian,
351    Oriya,
352    Osmanya,
353    Runic,
354    Shavian,
355    Sinhala,
356    Syloti_Nagri,
357    Syriac,
358    Tagalog,
359    Tagbanwa,
360    Tai_Le,
361    Tamil,
362    Telugu,
363    Thaana,
364    Thai,
365    Tibetan,
366    Tifinagh,
367    Ugaritic,
368    Yi.
369    .P
370    Each character has exactly one general category property, specified by a
371    two-letter abbreviation. For compatibility with Perl, negation can be specified
372    by including a circumflex between the opening brace and the property name. For
373    example, \ep{^Lu} is the same as \eP{Lu}.
374    .P
375    If only one letter is specified with \ep or \eP, it includes all the general
376    category properties that start with that letter. In this case, in the absence
377    of negation, the curly brackets in the escape sequence are optional; these two
378    examples have the same effect:
379  .sp  .sp
380    \ep{L}    \ep{L}
381    \epL    \epL
382  .sp  .sp
383  The following property codes are supported:  The following general category property codes are supported:
384  .sp  .sp
385    C     Other    C     Other
386    Cc    Control    Cc    Control
# Line 347  The following property codes are support Line 426  The following property codes are support
426    Zp    Paragraph separator    Zp    Paragraph separator
427    Zs    Space separator    Zs    Space separator
428  .sp  .sp
429  Extended properties such as "Greek" or "InMusicalSymbols" are not supported by  The special property L& is also supported: it matches a character that has
430  PCRE.  the Lu, Ll, or Lt property, in other words, a letter that is not classified as
431    a modifier or "other".
432    .P
433    The long synonyms for these properties that Perl supports (such as \ep{Letter})
434    are not supported by PCRE, nor is it permitted to prefix any of these
435    properties with "Is".
436    .P
437    No character that is in the Unicode table has the Cn (unassigned) property.
438    Instead, this property is assumed for any code point that is not in the
439    Unicode table.
440  .P  .P
441  Specifying caseless matching does not affect these escape sequences. For  Specifying caseless matching does not affect these escape sequences. For
442  example, \ep{Lu} always matches only upper case letters.  example, \ep{Lu} always matches only upper case letters.
# Line 386  subpatterns for more complicated asserti Line 474  subpatterns for more complicated asserti
474  .\" </a>  .\" </a>
475  below.  below.
476  .\"  .\"
477  The backslashed  The backslashed assertions are:
 assertions are:  
478  .sp  .sp
479    \eb     matches at a word boundary    \eb     matches at a word boundary
480    \eB     matches when not at a word boundary    \eB     matches when not at a word boundary
# Line 412  PCRE_NOTBOL or PCRE_NOTEOL options, whic Line 499  PCRE_NOTBOL or PCRE_NOTEOL options, whic
499  circumflex and dollar metacharacters. However, if the \fIstartoffset\fP  circumflex and dollar metacharacters. However, if the \fIstartoffset\fP
500  argument of \fBpcre_exec()\fP is non-zero, indicating that matching is to start  argument of \fBpcre_exec()\fP is non-zero, indicating that matching is to start
501  at a point other than the beginning of the subject, \eA can never match. The  at a point other than the beginning of the subject, \eA can never match. The
502  difference between \eZ and \ez is that \eZ matches before a newline that is the  difference between \eZ and \ez is that \eZ matches before a newline at the end
503  last character of the string as well as at the end of the string, whereas \ez  of the string as well as at the very end, whereas \ez matches only at the end.
 matches only at the end.  
504  .P  .P
505  The \eG assertion is true only when the current matching position is at the  The \eG assertion is true only when the current matching position is at the
506  start point of the match, as specified by the \fIstartoffset\fP argument of  start point of the match, as specified by the \fIstartoffset\fP argument of
# Line 458  to be anchored.) Line 544  to be anchored.)
544  .P  .P
545  A dollar character is an assertion that is true only if the current matching  A dollar character is an assertion that is true only if the current matching
546  point is at the end of the subject string, or immediately before a newline  point is at the end of the subject string, or immediately before a newline
547  character that is the last character in the string (by default). Dollar need  at the end of the string (by default). Dollar need not be the last character of
548  not be the last character of the pattern if a number of alternatives are  the pattern if a number of alternatives are involved, but it should be the last
549  involved, but it should be the last item in any branch in which it appears.  item in any branch in which it appears. Dollar has no special meaning in a
550  Dollar has no special meaning in a character class.  character class.
551  .P  .P
552  The meaning of dollar can be changed so that it matches only at the very end of  The meaning of dollar can be changed so that it matches only at the very end of
553  the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This  the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
554  does not affect the \eZ assertion.  does not affect the \eZ assertion.
555  .P  .P
556  The meanings of the circumflex and dollar characters are changed if the  The meanings of the circumflex and dollar characters are changed if the
557  PCRE_MULTILINE option is set. When this is the case, they match immediately  PCRE_MULTILINE option is set. When this is the case, a circumflex matches
558  after and immediately before an internal newline character, respectively, in  immediately after internal newlines as well as at the start of the subject
559  addition to matching at the start and end of the subject string. For example,  string. It does not match after a newline that ends the string. A dollar
560  the pattern /^abc$/ matches the subject string "def\enabc" (where \en  matches before any newlines in the string, as well as at the very end, when
561  represents a newline character) in multiline mode, but not otherwise.  PCRE_MULTILINE is set. When newline is specified as the two-character
562  Consequently, patterns that are anchored in single line mode because all  sequence CRLF, isolated CR and LF characters do not indicate newlines.
563  branches start with ^ are not anchored in multiline mode, and a match for  .P
564  circumflex is possible when the \fIstartoffset\fP argument of \fBpcre_exec()\fP  For example, the pattern /^abc$/ matches the subject string "def\enabc" (where
565  is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is  \en represents a newline) in multiline mode, but not otherwise. Consequently,
566  set.  patterns that are anchored in single line mode because all branches start with
567    ^ are not anchored in multiline mode, and a match for circumflex is possible
568    when the \fIstartoffset\fP argument of \fBpcre_exec()\fP is non-zero. The
569    PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
570  .P  .P
571  Note that the sequences \eA, \eZ, and \ez can be used to match the start and  Note that the sequences \eA, \eZ, and \ez can be used to match the start and
572  end of the subject in both modes, and if all branches of a pattern start with  end of the subject in both modes, and if all branches of a pattern start with
573  \eA it is always anchored, whether PCRE_MULTILINE is set or not.  \eA it is always anchored, whether or not PCRE_MULTILINE is set.
574  .  .
575  .  .
576  .SH "FULL STOP (PERIOD, DOT)"  .SH "FULL STOP (PERIOD, DOT)"
577  .rs  .rs
578  .sp  .sp
579  Outside a character class, a dot in the pattern matches any one character in  Outside a character class, a dot in the pattern matches any one character in
580  the subject, including a non-printing character, but not (by default) newline.  the subject string except (by default) a character that signifies the end of a
581  In UTF-8 mode, a dot matches any UTF-8 character, which might be more than one  line. In UTF-8 mode, the matched character may be more than one byte long. When
582  byte long, except (by default) newline. If the PCRE_DOTALL option is set,  a line ending is defined as a single character (CR or LF), dot never matches
583  dots match newlines as well. The handling of dot is entirely independent of the  that character; when the two-character sequence CRLF is used, dot does not
584  handling of circumflex and dollar, the only relationship being that they both  match CR if it is immediately followed by LF, but otherwise it matches all
585  involve newline characters. Dot has no special meaning in a character class.  characters (including isolated CRs and LFs).
586    .P
587    The behaviour of dot with regard to newlines can be changed. If the PCRE_DOTALL
588    option is set, a dot matches any one character, without exception. If newline
589    is defined as the two-character sequence CRLF, it takes two dots to match it.
590    .P
591    The handling of dot is entirely independent of the handling of circumflex and
592    dollar, the only relationship being that they both involve newlines. Dot has no
593    special meaning in a character class.
594  .  .
595  .  .
596  .SH "MATCHING A SINGLE BYTE"  .SH "MATCHING A SINGLE BYTE"
597  .rs  .rs
598  .sp  .sp
599  Outside a character class, the escape sequence \eC matches any one byte, both  Outside a character class, the escape sequence \eC matches any one byte, both
600  in and out of UTF-8 mode. Unlike a dot, it can match a newline. The feature is  in and out of UTF-8 mode. Unlike a dot, it always matches CR and LF. The
601  provided in Perl in order to match individual bytes in UTF-8 mode. Because it  feature is provided in Perl in order to match individual bytes in UTF-8 mode.
602  breaks up UTF-8 characters into individual bytes, what remains in the string  Because it breaks up UTF-8 characters into individual bytes, what remains in
603  may be a malformed UTF-8 string. For this reason, the \eC escape sequence is  the string may be a malformed UTF-8 string. For this reason, the \eC escape
604  best avoided.  sequence is best avoided.
605  .P  .P
606  PCRE does not allow \eC to appear in lookbehind assertions  PCRE does not allow \eC to appear in lookbehind assertions
607  .\" HTML <a href="#lookbehind">  .\" HTML <a href="#lookbehind">
# Line 555  If you want to use caseless matching for Line 652  If you want to use caseless matching for
652  ensure that PCRE is compiled with Unicode property support as well as with  ensure that PCRE is compiled with Unicode property support as well as with
653  UTF-8 support.  UTF-8 support.
654  .P  .P
655  The newline character is never treated in any special way in character classes,  Characters that might indicate line breaks (CR and LF) are never treated in any
656  whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class  special way when matching character classes, whatever line-ending sequence is
657  such as [^a] will always match a newline.  in use, and whatever setting of the PCRE_DOTALL and PCRE_MULTILINE options is
658    used. A class such as [^a] always matches one of these characters.
659  .P  .P
660  The minus (hyphen) character can be used to specify a range of characters in a  The minus (hyphen) character can be used to specify a range of characters in a
661  character class. For example, [d-m] matches any letter between d and m,  character class. For example, [d-m] matches any letter between d and m,
# Line 656  the pattern Line 754  the pattern
754    gilbert|sullivan    gilbert|sullivan
755  .sp  .sp
756  matches either "gilbert" or "sullivan". Any number of alternatives may appear,  matches either "gilbert" or "sullivan". Any number of alternatives may appear,
757  and an empty alternative is permitted (matching the empty string).  and an empty alternative is permitted (matching the empty string). The matching
758  The matching process tries each alternative in turn, from left to right,  process tries each alternative in turn, from left to right, and the first one
759  and the first one that succeeds is used. If the alternatives are within a  that succeeds is used. If the alternatives are within a subpattern
 subpattern  
760  .\" HTML <a href="#subpattern">  .\" HTML <a href="#subpattern">
761  .\" </a>  .\" </a>
762  (defined below),  (defined below),
# Line 710  branch is abandoned before the option se Line 807  branch is abandoned before the option se
807  option settings happen at compile time. There would be some very weird  option settings happen at compile time. There would be some very weird
808  behaviour otherwise.  behaviour otherwise.
809  .P  .P
810  The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the  The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
811  same way as the Perl-compatible options by using the characters U and X  changed in the same way as the Perl-compatible options by using the characters
812  respectively. The (?X) flag setting is special in that it must always occur  J, U and X respectively.
 earlier in the pattern than any of the additional features it turns on, even  
 when it is at top level. It is best to put it at the start.  
813  .  .
814  .  .
815  .\" HTML <a name="subpattern"></a>  .\" HTML <a name="subpattern"></a>
# Line 777  Identifying capturing parentheses by num Line 872  Identifying capturing parentheses by num
872  to keep track of the numbers in complicated regular expressions. Furthermore,  to keep track of the numbers in complicated regular expressions. Furthermore,
873  if an expression is modified, the numbers may change. To help with this  if an expression is modified, the numbers may change. To help with this
874  difficulty, PCRE supports the naming of subpatterns, something that Perl does  difficulty, PCRE supports the naming of subpatterns, something that Perl does
875  not provide. The Python syntax (?P<name>...) is used. Names consist of  not provide. The Python syntax (?P<name>...) is used. References to capturing
876  alphanumeric characters and underscores, and must be unique within a pattern.  parentheses from other parts of the pattern, such as
877    .\" HTML <a href="#backreferences">
878    .\" </a>
879    backreferences,
880    .\"
881    .\" HTML <a href="#recursion">
882    .\" </a>
883    recursion,
884    .\"
885    and
886    .\" HTML <a href="#conditions">
887    .\" </a>
888    conditions,
889    .\"
890    can be made by name as well as by number.
891  .P  .P
892  Named capturing parentheses are still allocated numbers as well as names. The  Names consist of up to 32 alphanumeric characters and underscores. Named
893  PCRE API provides function calls for extracting the name-to-number translation  capturing parentheses are still allocated numbers as well as names. The PCRE
894  table from a compiled pattern. There is also a convenience function for  API provides function calls for extracting the name-to-number translation table
895  extracting a captured substring by name. For further details see the  from a compiled pattern. There is also a convenience function for extracting a
896    captured substring by name.
897    .P
898    By default, a name must be unique within a pattern, but it is possible to relax
899    this constraint by setting the PCRE_DUPNAMES option at compile time. This can
900    be useful for patterns where only one instance of the named parentheses can
901    match. Suppose you want to match the name of a weekday, either as a 3-letter
902    abbreviation or as the full name, and in both cases you want to extract the
903    abbreviation. This pattern (ignoring the line breaks) does the job:
904    .sp
905      (?P<DN>Mon|Fri|Sun)(?:day)?|
906      (?P<DN>Tue)(?:sday)?|
907      (?P<DN>Wed)(?:nesday)?|
908      (?P<DN>Thu)(?:rsday)?|
909      (?P<DN>Sat)(?:urday)?
910    .sp
911    There are five capturing substrings, but only one is ever set after a match.
912    The convenience function for extracting the data by name returns the substring
913    for the first, and in this example, the only, subpattern of that name that
914    matched. This saves searching to find which numbered subpattern it was. If you
915    make a reference to a non-unique named subpattern from elsewhere in the
916    pattern, the one that corresponds to the lowest number is used. For further
917    details of the interfaces for handling named subpatterns, see the
918  .\" HREF  .\" HREF
919  \fBpcreapi\fP  \fBpcreapi\fP
920  .\"  .\"
# Line 987  option is ignored. They are a convenient Line 1118  option is ignored. They are a convenient
1118  atomic group. However, there is no difference in the meaning or processing of a  atomic group. However, there is no difference in the meaning or processing of a
1119  possessive quantifier and the equivalent atomic group.  possessive quantifier and the equivalent atomic group.
1120  .P  .P
1121  The possessive quantifier syntax is an extension to the Perl syntax. It  The possessive quantifier syntax is an extension to the Perl syntax. Jeffrey
1122  originates in Sun's Java package.  Friedl originated the idea (and the name) in the first edition of his book.
1123    Mike McCloskey liked it, so implemented it when he built Sun's Java package,
1124    and PCRE copied it from there.
1125  .P  .P
1126  When a pattern contains an unlimited repeat inside a subpattern that can itself  When a pattern contains an unlimited repeat inside a subpattern that can itself
1127  be repeated an unlimited number of times, the use of an atomic group is the  be repeated an unlimited number of times, the use of an atomic group is the
# Line 1030  However, if the decimal number following Line 1163  However, if the decimal number following
1163  always taken as a back reference, and causes an error only if there are not  always taken as a back reference, and causes an error only if there are not
1164  that many capturing left parentheses in the entire pattern. In other words, the  that many capturing left parentheses in the entire pattern. In other words, the
1165  parentheses that are referenced need not be to the left of the reference for  parentheses that are referenced need not be to the left of the reference for
1166  numbers less than 10. See the subsection entitled "Non-printing characters"  numbers less than 10. A "forward back reference" of this type can make sense
1167    when a repetition is involved and the subpattern to the right has participated
1168    in an earlier iteration.
1169    .P
1170    It is not possible to have a numerical "forward back reference" to subpattern
1171    whose number is 10 or more. However, a back reference to any subpattern is
1172    possible using named parentheses (see below). See also the subsection entitled
1173    "Non-printing characters"
1174  .\" HTML <a href="#digitsafterbackslash">  .\" HTML <a href="#digitsafterbackslash">
1175  .\" </a>  .\" </a>
1176  above  above
# Line 1060  capturing subpattern is matched caseless Line 1200  capturing subpattern is matched caseless
1200  Back references to named subpatterns use the Python syntax (?P=name). We could  Back references to named subpatterns use the Python syntax (?P=name). We could
1201  rewrite the above example as follows:  rewrite the above example as follows:
1202  .sp  .sp
1203    (?<p1>(?i)rah)\es+(?P=p1)    (?P<p1>(?i)rah)\es+(?P=p1)
1204  .sp  .sp
1205    A subpattern that is referenced by name may appear in the pattern before or
1206    after the reference.
1207    .P
1208  There may be more than one back reference to the same subpattern. If a  There may be more than one back reference to the same subpattern. If a
1209  subpattern has not actually been used in a particular match, any back  subpattern has not actually been used in a particular match, any back
1210  references to it always fail. For example, the pattern  references to it always fail. For example, the pattern
# Line 1123  because it does not make sense for negat Line 1266  because it does not make sense for negat
1266  .SS "Lookahead assertions"  .SS "Lookahead assertions"
1267  .rs  .rs
1268  .sp  .sp
1269  Lookahead assertions start  Lookahead assertions start with (?= for positive assertions and (?! for
1270  with (?= for positive assertions and (?! for negative assertions. For example,  negative assertions. For example,
1271  .sp  .sp
1272    \ew+(?=;)    \ew+(?=;)
1273  .sp  .sp
# Line 1159  negative assertions. For example, Line 1302  negative assertions. For example,
1302  .sp  .sp
1303  does find an occurrence of "bar" that is not preceded by "foo". The contents of  does find an occurrence of "bar" that is not preceded by "foo". The contents of
1304  a lookbehind assertion are restricted such that all the strings it matches must  a lookbehind assertion are restricted such that all the strings it matches must
1305  have a fixed length. However, if there are several alternatives, they do not  have a fixed length. However, if there are several top-level alternatives, they
1306  all have to have the same fixed length. Thus  do not all have to have the same fixed length. Thus
1307  .sp  .sp
1308    (?<=bullock|donkey)    (?<=bullock|donkey)
1309  .sp  .sp
# Line 1254  is another pattern that matches "foo" pr Line 1397  is another pattern that matches "foo" pr
1397  characters that are not "999".  characters that are not "999".
1398  .  .
1399  .  .
1400    .\" HTML <a name="conditions"></a>
1401  .SH "CONDITIONAL SUBPATTERNS"  .SH "CONDITIONAL SUBPATTERNS"
1402  .rs  .rs
1403  .sp  .sp
# Line 1270  no-pattern (if present) is used. If ther Line 1414  no-pattern (if present) is used. If ther
1414  subpattern, a compile-time error occurs.  subpattern, a compile-time error occurs.
1415  .P  .P
1416  There are three kinds of condition. If the text between the parentheses  There are three kinds of condition. If the text between the parentheses
1417  consists of a sequence of digits, the condition is satisfied if the capturing  consists of a sequence of digits, or a sequence of alphanumeric characters and
1418  subpattern of that number has previously matched. The number must be greater  underscores, the condition is satisfied if the capturing subpattern of that
1419  than zero. Consider the following pattern, which contains non-significant white  number or name has previously matched. There is a possible ambiguity here,
1420  space to make it more readable (assume the PCRE_EXTENDED option) and to divide  because subpattern names may consist entirely of digits. PCRE looks first for a
1421  it into three parts for ease of discussion:  named subpattern; if it cannot find one and the text consists entirely of
1422    digits, it looks for a subpattern of that number, which must be greater than
1423    zero. Using subpattern names that consist entirely of digits is not
1424    recommended.
1425    .P
1426    Consider the following pattern, which contains non-significant white space to
1427    make it more readable (assume the PCRE_EXTENDED option) and to divide it into
1428    three parts for ease of discussion:
1429  .sp  .sp
1430    ( \e( )?    [^()]+    (?(1) \e) )    ( \e( )?    [^()]+    (?(1) \e) )
1431  .sp  .sp
# Line 1286  or not. If they did, that is, if subject Line 1437  or not. If they did, that is, if subject
1437  the condition is true, and so the yes-pattern is executed and a closing  the condition is true, and so the yes-pattern is executed and a closing
1438  parenthesis is required. Otherwise, since no-pattern is not present, the  parenthesis is required. Otherwise, since no-pattern is not present, the
1439  subpattern matches nothing. In other words, this pattern matches a sequence of  subpattern matches nothing. In other words, this pattern matches a sequence of
1440  non-parentheses, optionally enclosed in parentheses.  non-parentheses, optionally enclosed in parentheses. Rewriting it to use a
1441  .P  named subpattern gives this:
1442  If the condition is the string (R), it is satisfied if a recursive call to the  .sp
1443  pattern or subpattern has been made. At "top level", the condition is false.    (?P<OPEN> \e( )?    [^()]+    (?(OPEN) \e) )
1444  This is a PCRE extension. Recursive patterns are described in the next section.  .sp
1445    If the condition is the string (R), and there is no subpattern with the name R,
1446    the condition is satisfied if a recursive call to the pattern or subpattern has
1447    been made. At "top level", the condition is false. This is a PCRE extension.
1448    Recursive patterns are described in the next section.
1449  .P  .P
1450  If the condition is not a sequence of digits or (R), it must be an assertion.  If the condition is not a sequence of digits or (R), it must be an assertion.
1451  This may be a positive or negative lookahead or lookbehind assertion. Consider  This may be a positive or negative lookahead or lookbehind assertion. Consider
# Line 1317  closing parenthesis. Nested parentheses Line 1472  closing parenthesis. Nested parentheses
1472  that make up a comment play no part in the pattern matching at all.  that make up a comment play no part in the pattern matching at all.
1473  .P  .P
1474  If the PCRE_EXTENDED option is set, an unescaped # character outside a  If the PCRE_EXTENDED option is set, an unescaped # character outside a
1475  character class introduces a comment that continues up to the next newline  character class introduces a comment that continues to immediately after the
1476  character in the pattern.  next newline in the pattern.
1477  .  .
1478  .  .
1479    .\" HTML <a name="recursion"></a>
1480  .SH "RECURSIVE PATTERNS"  .SH "RECURSIVE PATTERNS"
1481  .rs  .rs
1482  .sp  .sp
# Line 1346  number, provided that it occurs inside t Line 1502  number, provided that it occurs inside t
1502  "subroutine" call, which is described in the next section.) The special item  "subroutine" call, which is described in the next section.) The special item
1503  (?R) is a recursive call of the entire regular expression.  (?R) is a recursive call of the entire regular expression.
1504  .P  .P
1505  For example, this PCRE pattern solves the nested parentheses problem (assume  A recursive subpattern call is always treated as an atomic group. That is, once
1506  the PCRE_EXTENDED option is set so that white space is ignored):  it has matched some of the subject string, it is never re-entered, even if
1507    it contains untried alternatives and there is a subsequent matching failure.
1508    .P
1509    This PCRE pattern solves the nested parentheses problem (assume the
1510    PCRE_EXTENDED option is set so that white space is ignored):
1511  .sp  .sp
1512    \e( ( (?>[^()]+) | (?R) )* \e)    \e( ( (?>[^()]+) | (?R) )* \e)
1513  .sp  .sp
1514  First it matches an opening parenthesis. Then it matches any number of  First it matches an opening parenthesis. Then it matches any number of
1515  substrings which can either be a sequence of non-parentheses, or a recursive  substrings which can either be a sequence of non-parentheses, or a recursive
1516  match of the pattern itself (that is a correctly parenthesized substring).  match of the pattern itself (that is, a correctly parenthesized substring).
1517  Finally there is a closing parenthesis.  Finally there is a closing parenthesis.
1518  .P  .P
1519  If this were part of a larger pattern, you would not want to recurse the entire  If this were part of a larger pattern, you would not want to recurse the entire
# Line 1435  matches "sense and sensibility" and "res Line 1595  matches "sense and sensibility" and "res
1595    (sens|respons)e and (?1)ibility    (sens|respons)e and (?1)ibility
1596  .sp  .sp
1597  is used, it does match "sense and responsibility" as well as the other two  is used, it does match "sense and responsibility" as well as the other two
1598  strings. Such references must, however, follow the subpattern to which they  strings. Such references, if given numerically, must follow the subpattern to
1599  refer.  which they refer. However, named references can refer to later subpatterns.
1600    .P
1601    Like recursive subpatterns, a "subroutine" call is always treated as an atomic
1602    group. That is, once it has matched some of the subject string, it is never
1603    re-entered, even if it contains untried alternatives and there is a subsequent
1604    matching failure.
1605  .  .
1606  .  .
1607  .SH CALLOUTS  .SH CALLOUTS
# Line 1475  description of the interface to the call Line 1640  description of the interface to the call
1640  documentation.  documentation.
1641  .P  .P
1642  .in 0  .in 0
1643  Last updated: 28 February 2005  Last updated: 06 June 2006
1644  .br  .br
1645  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.

Legend:
Removed from v.77  
changed lines
  Added in v.91

  ViewVC Help
Powered by ViewVC 1.1.5