/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 77 by nigel, Sat Feb 24 21:40:45 2007 UTC revision 91 by nigel, Sat Feb 24 21:41:34 2007 UTC
# Line 123  The following sections describe the use Line 123  The following sections describe the use
123  <br><a name="SEC2" href="#TOC1">BACKSLASH</a><br>  <br><a name="SEC2" href="#TOC1">BACKSLASH</a><br>
124  <P>  <P>
125  The backslash character has several uses. Firstly, if it is followed by a  The backslash character has several uses. Firstly, if it is followed by a
126  non-alphanumeric character, it takes away any special meaning that character may  non-alphanumeric character, it takes away any special meaning that character
127  have. This use of backslash as an escape character applies both inside and  may have. This use of backslash as an escape character applies both inside and
128  outside character classes.  outside character classes.
129  </P>  </P>
130  <P>  <P>
# Line 137  particular, if you want to match a backs Line 137  particular, if you want to match a backs
137  <P>  <P>
138  If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the  If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
139  pattern (other than in a character class) and characters between a # outside  pattern (other than in a character class) and characters between a # outside
140  a character class and the next newline character are ignored. An escaping  a character class and the next newline are ignored. An escaping backslash can
141  backslash can be used to include a whitespace or # character as part of the  be used to include a whitespace or # character as part of the pattern.
 pattern.  
142  </P>  </P>
143  <P>  <P>
144  If you want to remove the special meaning from a sequence of characters, you  If you want to remove the special meaning from a sequence of characters, you
# Line 175  represents: Line 174  represents:
174    \t        tab (hex 09)    \t        tab (hex 09)
175    \ddd      character with octal code ddd, or backreference    \ddd      character with octal code ddd, or backreference
176    \xhh      character with hex code hh    \xhh      character with hex code hh
177    \x{hhh..} character with hex code hhh... (UTF-8 mode only)    \x{hhh..} character with hex code hhh..
178  </pre>  </pre>
179  The precise effect of \cx is as follows: if x is a lower case letter, it  The precise effect of \cx is as follows: if x is a lower case letter, it
180  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 184  Thus \cz becomes hex 1A, but \c{ becomes Line 183  Thus \cz becomes hex 1A, but \c{ becomes
183  </P>  </P>
184  <P>  <P>
185  After \x, from zero to two hexadecimal digits are read (letters can be in  After \x, from zero to two hexadecimal digits are read (letters can be in
186  upper or lower case). In UTF-8 mode, any number of hexadecimal digits may  upper or lower case). Any number of hexadecimal digits may appear between \x{
187  appear between \x{ and }, but the value of the character code must be less  and }, but the value of the character code must be less than 256 in non-UTF-8
188  than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters  mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value
189  other than hexadecimal digits appear between \x{ and }, or if there is no  is 7FFFFFFF). If characters other than hexadecimal digits appear between \x{
190  terminating }, this form of escape is not recognized. Instead, the initial  and }, or if there is no terminating }, this form of escape is not recognized.
191  \x will be interpreted as a basic hexadecimal escape, with no following  Instead, the initial \x will be interpreted as a basic hexadecimal escape,
192  digits, giving a character whose value is zero.  with no following digits, giving a character whose value is zero.
193  </P>  </P>
194  <P>  <P>
195  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
196  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference in the  syntaxes for \x. There is no difference in the way they are handled. For
197  way they are handled. For example, \xdc is exactly the same as \x{dc}.  example, \xdc is exactly the same as \x{dc}.
198  </P>  </P>
199  <P>  <P>
200  After \0 up to two further octal digits are read. In both cases, if there  After \0 up to two further octal digits are read. If there are fewer than two
201  are fewer than two digits, just those that are present are used. Thus the  digits, just those that are present are used. Thus the sequence \0\x\07
202  sequence \0\x\07 specifies two binary zeros followed by a BEL character  specifies two binary zeros followed by a BEL character (code value 7). Make
203  (code value 7). Make sure you supply two digits after the initial zero if the  sure you supply two digits after the initial zero if the pattern character that
204  pattern character that follows is itself an octal digit.  follows is itself an octal digit.
205  </P>  </P>
206  <P>  <P>
207  The handling of a backslash followed by a digit other than 0 is complicated.  The handling of a backslash followed by a digit other than 0 is complicated.
# Line 217  following the discussion of Line 216  following the discussion of
216  <P>  <P>
217  Inside a character class, or if the decimal number is greater than 9 and there  Inside a character class, or if the decimal number is greater than 9 and there
218  have not been that many capturing subpatterns, PCRE re-reads up to three octal  have not been that many capturing subpatterns, PCRE re-reads up to three octal
219  digits following the backslash, and generates a single byte from the least  digits following the backslash, ane uses them to generate a data character. Any
220  significant 8 bits of the value. Any subsequent digits stand for themselves.  subsequent digits stand for themselves. In non-UTF-8 mode, the value of a
221  For example:  character specified in octal must be less than \400. In UTF-8 mode, values up
222    to \777 are permitted. For example:
223  <pre>  <pre>
224    \040   is another way of writing a space    \040   is another way of writing a space
225    \40    is the same, provided there are fewer than 40 previous capturing subpatterns    \40    is the same, provided there are fewer than 40 previous capturing subpatterns
# Line 235  Note that octal values of 100 or greater Line 235  Note that octal values of 100 or greater
235  zero, because no more than three octal digits are ever read.  zero, because no more than three octal digits are ever read.
236  </P>  </P>
237  <P>  <P>
238  All the sequences that define a single byte value or a single UTF-8 character  All the sequences that define a single character value can be used both inside
239  (in UTF-8 mode) can be used both inside and outside character classes. In  and outside character classes. In addition, inside a character class, the
240  addition, inside a character class, the sequence \b is interpreted as the  sequence \b is interpreted as the backspace character (hex 08), and the
241  backspace character (hex 08), and the sequence \X is interpreted as the  sequence \X is interpreted as the character "X". Outside a character class,
242  character "X". Outside a character class, these sequences have different  these sequences have different meanings
 meanings  
243  <a href="#uniextseq">(see below).</a>  <a href="#uniextseq">(see below).</a>
244  </P>  </P>
245  <br><b>  <br><b>
# Line 269  there is no character to match. Line 268  there is no character to match.
268  <P>  <P>
269  For compatibility with Perl, \s does not match the VT character (code 11).  For compatibility with Perl, \s does not match the VT character (code 11).
270  This makes it different from the the POSIX "space" class. The \s characters  This makes it different from the the POSIX "space" class. The \s characters
271  are HT (9), LF (10), FF (12), CR (13), and space (32).  are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is
272    included in a Perl script, \s may match the VT character. In PCRE, it never
273    does.)
274  </P>  </P>
275  <P>  <P>
276  A "word" character is an underscore or any character less than 256 that is a  A "word" character is an underscore or any character less than 256 that is a
# Line 285  greater than 128 are used for accented l Line 286  greater than 128 are used for accented l
286  <P>  <P>
287  In UTF-8 mode, characters with values greater than 128 never match \d, \s, or  In UTF-8 mode, characters with values greater than 128 never match \d, \s, or
288  \w, and always match \D, \S, and \W. This is true even when Unicode  \w, and always match \D, \S, and \W. This is true even when Unicode
289  character property support is available.  character property support is available. The use of locales with Unicode is
290    discouraged.
291  <a name="uniextseq"></a></P>  <a name="uniextseq"></a></P>
292  <br><b>  <br><b>
293  Unicode character properties  Unicode character properties
294  </b><br>  </b><br>
295  <P>  <P>
296  When PCRE is built with Unicode character property support, three additional  When PCRE is built with Unicode character property support, three additional
297  escape sequences to match generic character types are available when UTF-8 mode  escape sequences to match character properties are available when UTF-8 mode
298  is selected. They are:  is selected. They are:
299  <pre>  <pre>
300   \p{<i>xx</i>}   a character with the <i>xx</i> property    \p{<i>xx</i>}   a character with the <i>xx</i> property
301   \P{<i>xx</i>}   a character without the <i>xx</i> property    \P{<i>xx</i>}   a character without the <i>xx</i> property
302   \X       an extended Unicode sequence    \X       an extended Unicode sequence
303  </pre>  </pre>
304  The property names represented by <i>xx</i> above are limited to the  The property names represented by <i>xx</i> above are limited to the Unicode
305  Unicode general category properties. Each character has exactly one such  script names, the general category properties, and "Any", which matches any
306  property, specified by a two-letter abbreviation. For compatibility with Perl,  character (including newline). Other properties such as "InMusicalSymbols" are
307  negation can be specified by including a circumflex between the opening brace  not currently supported by PCRE. Note that \P{Any} does not match any
308  and the property name. For example, \p{^Lu} is the same as \P{Lu}.  characters, so always causes a match failure.
309  </P>  </P>
310  <P>  <P>
311  If only one letter is specified with \p or \P, it includes all the properties  Sets of Unicode characters are defined as belonging to certain scripts. A
312  that start with that letter. In this case, in the absence of negation, the  character from one of these sets can be matched using a script name. For
313  curly brackets in the escape sequence are optional; these two examples have  example:
314  the same effect:  <pre>
315      \p{Greek}
316      \P{Han}
317    </pre>
318    Those that are not part of an identified script are lumped together as
319    "Common". The current list of scripts is:
320    </P>
321    <P>
322    Arabic,
323    Armenian,
324    Bengali,
325    Bopomofo,
326    Braille,
327    Buginese,
328    Buhid,
329    Canadian_Aboriginal,
330    Cherokee,
331    Common,
332    Coptic,
333    Cypriot,
334    Cyrillic,
335    Deseret,
336    Devanagari,
337    Ethiopic,
338    Georgian,
339    Glagolitic,
340    Gothic,
341    Greek,
342    Gujarati,
343    Gurmukhi,
344    Han,
345    Hangul,
346    Hanunoo,
347    Hebrew,
348    Hiragana,
349    Inherited,
350    Kannada,
351    Katakana,
352    Kharoshthi,
353    Khmer,
354    Lao,
355    Latin,
356    Limbu,
357    Linear_B,
358    Malayalam,
359    Mongolian,
360    Myanmar,
361    New_Tai_Lue,
362    Ogham,
363    Old_Italic,
364    Old_Persian,
365    Oriya,
366    Osmanya,
367    Runic,
368    Shavian,
369    Sinhala,
370    Syloti_Nagri,
371    Syriac,
372    Tagalog,
373    Tagbanwa,
374    Tai_Le,
375    Tamil,
376    Telugu,
377    Thaana,
378    Thai,
379    Tibetan,
380    Tifinagh,
381    Ugaritic,
382    Yi.
383    </P>
384    <P>
385    Each character has exactly one general category property, specified by a
386    two-letter abbreviation. For compatibility with Perl, negation can be specified
387    by including a circumflex between the opening brace and the property name. For
388    example, \p{^Lu} is the same as \P{Lu}.
389    </P>
390    <P>
391    If only one letter is specified with \p or \P, it includes all the general
392    category properties that start with that letter. In this case, in the absence
393    of negation, the curly brackets in the escape sequence are optional; these two
394    examples have the same effect:
395  <pre>  <pre>
396    \p{L}    \p{L}
397    \pL    \pL
398  </pre>  </pre>
399  The following property codes are supported:  The following general category property codes are supported:
400  <pre>  <pre>
401    C     Other    C     Other
402    Cc    Control    Cc    Control
# Line 360  The following property codes are support Line 442  The following property codes are support
442    Zp    Paragraph separator    Zp    Paragraph separator
443    Zs    Space separator    Zs    Space separator
444  </pre>  </pre>
445  Extended properties such as "Greek" or "InMusicalSymbols" are not supported by  The special property L& is also supported: it matches a character that has
446  PCRE.  the Lu, Ll, or Lt property, in other words, a letter that is not classified as
447    a modifier or "other".
448    </P>
449    <P>
450    The long synonyms for these properties that Perl supports (such as \p{Letter})
451    are not supported by PCRE, nor is it permitted to prefix any of these
452    properties with "Is".
453    </P>
454    <P>
455    No character that is in the Unicode table has the Cn (unassigned) property.
456    Instead, this property is assumed for any code point that is not in the
457    Unicode table.
458  </P>  </P>
459  <P>  <P>
460  Specifying caseless matching does not affect these escape sequences. For  Specifying caseless matching does not affect these escape sequences. For
# Line 395  specifies a condition that has to be met Line 488  specifies a condition that has to be met
488  without consuming any characters from the subject string. The use of  without consuming any characters from the subject string. The use of
489  subpatterns for more complicated assertions is described  subpatterns for more complicated assertions is described
490  <a href="#bigassertions">below.</a>  <a href="#bigassertions">below.</a>
491  The backslashed  The backslashed assertions are:
 assertions are:  
492  <pre>  <pre>
493    \b     matches at a word boundary    \b     matches at a word boundary
494    \B     matches when not at a word boundary    \B     matches when not at a word boundary
# Line 423  PCRE_NOTBOL or PCRE_NOTEOL options, whic Line 515  PCRE_NOTBOL or PCRE_NOTEOL options, whic
515  circumflex and dollar metacharacters. However, if the <i>startoffset</i>  circumflex and dollar metacharacters. However, if the <i>startoffset</i>
516  argument of <b>pcre_exec()</b> is non-zero, indicating that matching is to start  argument of <b>pcre_exec()</b> is non-zero, indicating that matching is to start
517  at a point other than the beginning of the subject, \A can never match. The  at a point other than the beginning of the subject, \A can never match. The
518  difference between \Z and \z is that \Z matches before a newline that is the  difference between \Z and \z is that \Z matches before a newline at the end
519  last character of the string as well as at the end of the string, whereas \z  of the string as well as at the very end, whereas \z matches only at the end.
 matches only at the end.  
520  </P>  </P>
521  <P>  <P>
522  The \G assertion is true only when the current matching position is at the  The \G assertion is true only when the current matching position is at the
# Line 469  to be anchored.) Line 560  to be anchored.)
560  <P>  <P>
561  A dollar character is an assertion that is true only if the current matching  A dollar character is an assertion that is true only if the current matching
562  point is at the end of the subject string, or immediately before a newline  point is at the end of the subject string, or immediately before a newline
563  character that is the last character in the string (by default). Dollar need  at the end of the string (by default). Dollar need not be the last character of
564  not be the last character of the pattern if a number of alternatives are  the pattern if a number of alternatives are involved, but it should be the last
565  involved, but it should be the last item in any branch in which it appears.  item in any branch in which it appears. Dollar has no special meaning in a
566  Dollar has no special meaning in a character class.  character class.
567  </P>  </P>
568  <P>  <P>
569  The meaning of dollar can be changed so that it matches only at the very end of  The meaning of dollar can be changed so that it matches only at the very end of
# Line 481  does not affect the \Z assertion. Line 572  does not affect the \Z assertion.
572  </P>  </P>
573  <P>  <P>
574  The meanings of the circumflex and dollar characters are changed if the  The meanings of the circumflex and dollar characters are changed if the
575  PCRE_MULTILINE option is set. When this is the case, they match immediately  PCRE_MULTILINE option is set. When this is the case, a circumflex matches
576  after and immediately before an internal newline character, respectively, in  immediately after internal newlines as well as at the start of the subject
577  addition to matching at the start and end of the subject string. For example,  string. It does not match after a newline that ends the string. A dollar
578  the pattern /^abc$/ matches the subject string "def\nabc" (where \n  matches before any newlines in the string, as well as at the very end, when
579  represents a newline character) in multiline mode, but not otherwise.  PCRE_MULTILINE is set. When newline is specified as the two-character
580  Consequently, patterns that are anchored in single line mode because all  sequence CRLF, isolated CR and LF characters do not indicate newlines.
581  branches start with ^ are not anchored in multiline mode, and a match for  </P>
582  circumflex is possible when the <i>startoffset</i> argument of <b>pcre_exec()</b>  <P>
583  is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is  For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
584  set.  \n represents a newline) in multiline mode, but not otherwise. Consequently,
585    patterns that are anchored in single line mode because all branches start with
586    ^ are not anchored in multiline mode, and a match for circumflex is possible
587    when the <i>startoffset</i> argument of <b>pcre_exec()</b> is non-zero. The
588    PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
589  </P>  </P>
590  <P>  <P>
591  Note that the sequences \A, \Z, and \z can be used to match the start and  Note that the sequences \A, \Z, and \z can be used to match the start and
592  end of the subject in both modes, and if all branches of a pattern start with  end of the subject in both modes, and if all branches of a pattern start with
593  \A it is always anchored, whether PCRE_MULTILINE is set or not.  \A it is always anchored, whether or not PCRE_MULTILINE is set.
594  </P>  </P>
595  <br><a name="SEC4" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>  <br><a name="SEC4" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>
596  <P>  <P>
597  Outside a character class, a dot in the pattern matches any one character in  Outside a character class, a dot in the pattern matches any one character in
598  the subject, including a non-printing character, but not (by default) newline.  the subject string except (by default) a character that signifies the end of a
599  In UTF-8 mode, a dot matches any UTF-8 character, which might be more than one  line. In UTF-8 mode, the matched character may be more than one byte long. When
600  byte long, except (by default) newline. If the PCRE_DOTALL option is set,  a line ending is defined as a single character (CR or LF), dot never matches
601  dots match newlines as well. The handling of dot is entirely independent of the  that character; when the two-character sequence CRLF is used, dot does not
602  handling of circumflex and dollar, the only relationship being that they both  match CR if it is immediately followed by LF, but otherwise it matches all
603  involve newline characters. Dot has no special meaning in a character class.  characters (including isolated CRs and LFs).
604    </P>
605    <P>
606    The behaviour of dot with regard to newlines can be changed. If the PCRE_DOTALL
607    option is set, a dot matches any one character, without exception. If newline
608    is defined as the two-character sequence CRLF, it takes two dots to match it.
609    </P>
610    <P>
611    The handling of dot is entirely independent of the handling of circumflex and
612    dollar, the only relationship being that they both involve newlines. Dot has no
613    special meaning in a character class.
614  </P>  </P>
615  <br><a name="SEC5" href="#TOC1">MATCHING A SINGLE BYTE</a><br>  <br><a name="SEC5" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
616  <P>  <P>
617  Outside a character class, the escape sequence \C matches any one byte, both  Outside a character class, the escape sequence \C matches any one byte, both
618  in and out of UTF-8 mode. Unlike a dot, it can match a newline. The feature is  in and out of UTF-8 mode. Unlike a dot, it always matches CR and LF. The
619  provided in Perl in order to match individual bytes in UTF-8 mode. Because it  feature is provided in Perl in order to match individual bytes in UTF-8 mode.
620  breaks up UTF-8 characters into individual bytes, what remains in the string  Because it breaks up UTF-8 characters into individual bytes, what remains in
621  may be a malformed UTF-8 string. For this reason, the \C escape sequence is  the string may be a malformed UTF-8 string. For this reason, the \C escape
622  best avoided.  sequence is best avoided.
623  </P>  </P>
624  <P>  <P>
625  PCRE does not allow \C to appear in lookbehind assertions  PCRE does not allow \C to appear in lookbehind assertions
# Line 565  ensure that PCRE is compiled with Unicod Line 670  ensure that PCRE is compiled with Unicod
670  UTF-8 support.  UTF-8 support.
671  </P>  </P>
672  <P>  <P>
673  The newline character is never treated in any special way in character classes,  Characters that might indicate line breaks (CR and LF) are never treated in any
674  whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class  special way when matching character classes, whatever line-ending sequence is
675  such as [^a] will always match a newline.  in use, and whatever setting of the PCRE_DOTALL and PCRE_MULTILINE options is
676    used. A class such as [^a] always matches one of these characters.
677  </P>  </P>
678  <P>  <P>
679  The minus (hyphen) character can be used to specify a range of characters in a  The minus (hyphen) character can be used to specify a range of characters in a
# Line 670  the pattern Line 776  the pattern
776    gilbert|sullivan    gilbert|sullivan
777  </pre>  </pre>
778  matches either "gilbert" or "sullivan". Any number of alternatives may appear,  matches either "gilbert" or "sullivan". Any number of alternatives may appear,
779  and an empty alternative is permitted (matching the empty string).  and an empty alternative is permitted (matching the empty string). The matching
780  The matching process tries each alternative in turn, from left to right,  process tries each alternative in turn, from left to right, and the first one
781  and the first one that succeeds is used. If the alternatives are within a  that succeeds is used. If the alternatives are within a subpattern
 subpattern  
782  <a href="#subpattern">(defined below),</a>  <a href="#subpattern">(defined below),</a>
783  "succeeds" means matching the rest of the main pattern as well as the  "succeeds" means matching the rest of the main pattern as well as the
784  alternative in the subpattern.  alternative in the subpattern.
# Line 722  option settings happen at compile time. Line 827  option settings happen at compile time.
827  behaviour otherwise.  behaviour otherwise.
828  </P>  </P>
829  <P>  <P>
830  The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the  The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
831  same way as the Perl-compatible options by using the characters U and X  changed in the same way as the Perl-compatible options by using the characters
832  respectively. The (?X) flag setting is special in that it must always occur  J, U and X respectively.
 earlier in the pattern than any of the additional features it turns on, even  
 when it is at top level. It is best to put it at the start.  
833  <a name="subpattern"></a></P>  <a name="subpattern"></a></P>
834  <br><a name="SEC10" href="#TOC1">SUBPATTERNS</a><br>  <br><a name="SEC10" href="#TOC1">SUBPATTERNS</a><br>
835  <P>  <P>
# Line 789  Identifying capturing parentheses by num Line 892  Identifying capturing parentheses by num
892  to keep track of the numbers in complicated regular expressions. Furthermore,  to keep track of the numbers in complicated regular expressions. Furthermore,
893  if an expression is modified, the numbers may change. To help with this  if an expression is modified, the numbers may change. To help with this
894  difficulty, PCRE supports the naming of subpatterns, something that Perl does  difficulty, PCRE supports the naming of subpatterns, something that Perl does
895  not provide. The Python syntax (?P&#60;name&#62;...) is used. Names consist of  not provide. The Python syntax (?P&#60;name&#62;...) is used. References to capturing
896  alphanumeric characters and underscores, and must be unique within a pattern.  parentheses from other parts of the pattern, such as
897  </P>  <a href="#backreferences">backreferences,</a>
898  <P>  <a href="#recursion">recursion,</a>
899  Named capturing parentheses are still allocated numbers as well as names. The  and
900  PCRE API provides function calls for extracting the name-to-number translation  <a href="#conditions">conditions,</a>
901  table from a compiled pattern. There is also a convenience function for  can be made by name as well as by number.
902  extracting a captured substring by name. For further details see the  </P>
903    <P>
904    Names consist of up to 32 alphanumeric characters and underscores. Named
905    capturing parentheses are still allocated numbers as well as names. The PCRE
906    API provides function calls for extracting the name-to-number translation table
907    from a compiled pattern. There is also a convenience function for extracting a
908    captured substring by name.
909    </P>
910    <P>
911    By default, a name must be unique within a pattern, but it is possible to relax
912    this constraint by setting the PCRE_DUPNAMES option at compile time. This can
913    be useful for patterns where only one instance of the named parentheses can
914    match. Suppose you want to match the name of a weekday, either as a 3-letter
915    abbreviation or as the full name, and in both cases you want to extract the
916    abbreviation. This pattern (ignoring the line breaks) does the job:
917    <pre>
918      (?P&#60;DN&#62;Mon|Fri|Sun)(?:day)?|
919      (?P&#60;DN&#62;Tue)(?:sday)?|
920      (?P&#60;DN&#62;Wed)(?:nesday)?|
921      (?P&#60;DN&#62;Thu)(?:rsday)?|
922      (?P&#60;DN&#62;Sat)(?:urday)?
923    </pre>
924    There are five capturing substrings, but only one is ever set after a match.
925    The convenience function for extracting the data by name returns the substring
926    for the first, and in this example, the only, subpattern of that name that
927    matched. This saves searching to find which numbered subpattern it was. If you
928    make a reference to a non-unique named subpattern from elsewhere in the
929    pattern, the one that corresponds to the lowest number is used. For further
930    details of the interfaces for handling named subpatterns, see the
931  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
932  documentation.  documentation.
933  </P>  </P>
# Line 1010  atomic group. However, there is no diffe Line 1141  atomic group. However, there is no diffe
1141  possessive quantifier and the equivalent atomic group.  possessive quantifier and the equivalent atomic group.
1142  </P>  </P>
1143  <P>  <P>
1144  The possessive quantifier syntax is an extension to the Perl syntax. It  The possessive quantifier syntax is an extension to the Perl syntax. Jeffrey
1145  originates in Sun's Java package.  Friedl originated the idea (and the name) in the first edition of his book.
1146    Mike McCloskey liked it, so implemented it when he built Sun's Java package,
1147    and PCRE copied it from there.
1148  </P>  </P>
1149  <P>  <P>
1150  When a pattern contains an unlimited repeat inside a subpattern that can itself  When a pattern contains an unlimited repeat inside a subpattern that can itself
# Line 1052  However, if the decimal number following Line 1185  However, if the decimal number following
1185  always taken as a back reference, and causes an error only if there are not  always taken as a back reference, and causes an error only if there are not
1186  that many capturing left parentheses in the entire pattern. In other words, the  that many capturing left parentheses in the entire pattern. In other words, the
1187  parentheses that are referenced need not be to the left of the reference for  parentheses that are referenced need not be to the left of the reference for
1188  numbers less than 10. See the subsection entitled "Non-printing characters"  numbers less than 10. A "forward back reference" of this type can make sense
1189    when a repetition is involved and the subpattern to the right has participated
1190    in an earlier iteration.
1191    </P>
1192    <P>
1193    It is not possible to have a numerical "forward back reference" to subpattern
1194    whose number is 10 or more. However, a back reference to any subpattern is
1195    possible using named parentheses (see below). See also the subsection entitled
1196    "Non-printing characters"
1197  <a href="#digitsafterbackslash">above</a>  <a href="#digitsafterbackslash">above</a>
1198  for further details of the handling of digits following a backslash.  for further details of the handling of digits following a backslash.
1199  </P>  </P>
# Line 1078  capturing subpattern is matched caseless Line 1219  capturing subpattern is matched caseless
1219  Back references to named subpatterns use the Python syntax (?P=name). We could  Back references to named subpatterns use the Python syntax (?P=name). We could
1220  rewrite the above example as follows:  rewrite the above example as follows:
1221  <pre>  <pre>
1222    (?&#60;p1&#62;(?i)rah)\s+(?P=p1)    (?P&#60;p1&#62;(?i)rah)\s+(?P=p1)
1223  </pre>  </pre>
1224    A subpattern that is referenced by name may appear in the pattern before or
1225    after the reference.
1226    </P>
1227    <P>
1228  There may be more than one back reference to the same subpattern. If a  There may be more than one back reference to the same subpattern. If a
1229  subpattern has not actually been used in a particular match, any back  subpattern has not actually been used in a particular match, any back
1230  references to it always fail. For example, the pattern  references to it always fail. For example, the pattern
# Line 1135  because it does not make sense for negat Line 1280  because it does not make sense for negat
1280  Lookahead assertions  Lookahead assertions
1281  </b><br>  </b><br>
1282  <P>  <P>
1283  Lookahead assertions start  Lookahead assertions start with (?= for positive assertions and (?! for
1284  with (?= for positive assertions and (?! for negative assertions. For example,  negative assertions. For example,
1285  <pre>  <pre>
1286    \w+(?=;)    \w+(?=;)
1287  </pre>  </pre>
# Line 1171  negative assertions. For example, Line 1316  negative assertions. For example,
1316  </pre>  </pre>
1317  does find an occurrence of "bar" that is not preceded by "foo". The contents of  does find an occurrence of "bar" that is not preceded by "foo". The contents of
1318  a lookbehind assertion are restricted such that all the strings it matches must  a lookbehind assertion are restricted such that all the strings it matches must
1319  have a fixed length. However, if there are several alternatives, they do not  have a fixed length. However, if there are several top-level alternatives, they
1320  all have to have the same fixed length. Thus  do not all have to have the same fixed length. Thus
1321  <pre>  <pre>
1322    (?&#60;=bullock|donkey)    (?&#60;=bullock|donkey)
1323  </pre>  </pre>
# Line 1267  preceded by "foo", while Line 1412  preceded by "foo", while
1412  </pre>  </pre>
1413  is another pattern that matches "foo" preceded by three digits and any three  is another pattern that matches "foo" preceded by three digits and any three
1414  characters that are not "999".  characters that are not "999".
1415  </P>  <a name="conditions"></a></P>
1416  <br><a name="SEC16" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>  <br><a name="SEC16" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
1417  <P>  <P>
1418  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
# Line 1284  subpattern, a compile-time error occurs. Line 1429  subpattern, a compile-time error occurs.
1429  </P>  </P>
1430  <P>  <P>
1431  There are three kinds of condition. If the text between the parentheses  There are three kinds of condition. If the text between the parentheses
1432  consists of a sequence of digits, the condition is satisfied if the capturing  consists of a sequence of digits, or a sequence of alphanumeric characters and
1433  subpattern of that number has previously matched. The number must be greater  underscores, the condition is satisfied if the capturing subpattern of that
1434  than zero. Consider the following pattern, which contains non-significant white  number or name has previously matched. There is a possible ambiguity here,
1435  space to make it more readable (assume the PCRE_EXTENDED option) and to divide  because subpattern names may consist entirely of digits. PCRE looks first for a
1436  it into three parts for ease of discussion:  named subpattern; if it cannot find one and the text consists entirely of
1437    digits, it looks for a subpattern of that number, which must be greater than
1438    zero. Using subpattern names that consist entirely of digits is not
1439    recommended.
1440    </P>
1441    <P>
1442    Consider the following pattern, which contains non-significant white space to
1443    make it more readable (assume the PCRE_EXTENDED option) and to divide it into
1444    three parts for ease of discussion:
1445  <pre>  <pre>
1446    ( \( )?    [^()]+    (?(1) \) )    ( \( )?    [^()]+    (?(1) \) )
1447  </pre>  </pre>
# Line 1300  or not. If they did, that is, if subject Line 1453  or not. If they did, that is, if subject
1453  the condition is true, and so the yes-pattern is executed and a closing  the condition is true, and so the yes-pattern is executed and a closing
1454  parenthesis is required. Otherwise, since no-pattern is not present, the  parenthesis is required. Otherwise, since no-pattern is not present, the
1455  subpattern matches nothing. In other words, this pattern matches a sequence of  subpattern matches nothing. In other words, this pattern matches a sequence of
1456  non-parentheses, optionally enclosed in parentheses.  non-parentheses, optionally enclosed in parentheses. Rewriting it to use a
1457  </P>  named subpattern gives this:
1458  <P>  <pre>
1459  If the condition is the string (R), it is satisfied if a recursive call to the    (?P&#60;OPEN&#62; \( )?    [^()]+    (?(OPEN) \) )
1460  pattern or subpattern has been made. At "top level", the condition is false.  </pre>
1461  This is a PCRE extension. Recursive patterns are described in the next section.  If the condition is the string (R), and there is no subpattern with the name R,
1462    the condition is satisfied if a recursive call to the pattern or subpattern has
1463    been made. At "top level", the condition is false. This is a PCRE extension.
1464    Recursive patterns are described in the next section.
1465  </P>  </P>
1466  <P>  <P>
1467  If the condition is not a sequence of digits or (R), it must be an assertion.  If the condition is not a sequence of digits or (R), it must be an assertion.
# Line 1331  that make up a comment play no part in t Line 1487  that make up a comment play no part in t
1487  </P>  </P>
1488  <P>  <P>
1489  If the PCRE_EXTENDED option is set, an unescaped # character outside a  If the PCRE_EXTENDED option is set, an unescaped # character outside a
1490  character class introduces a comment that continues up to the next newline  character class introduces a comment that continues to immediately after the
1491  character in the pattern.  next newline in the pattern.
1492  </P>  <a name="recursion"></a></P>
1493  <br><a name="SEC18" href="#TOC1">RECURSIVE PATTERNS</a><br>  <br><a name="SEC18" href="#TOC1">RECURSIVE PATTERNS</a><br>
1494  <P>  <P>
1495  Consider the problem of matching a string in parentheses, allowing for  Consider the problem of matching a string in parentheses, allowing for
# Line 1360  number, provided that it occurs inside t Line 1516  number, provided that it occurs inside t
1516  (?R) is a recursive call of the entire regular expression.  (?R) is a recursive call of the entire regular expression.
1517  </P>  </P>
1518  <P>  <P>
1519  For example, this PCRE pattern solves the nested parentheses problem (assume  A recursive subpattern call is always treated as an atomic group. That is, once
1520  the PCRE_EXTENDED option is set so that white space is ignored):  it has matched some of the subject string, it is never re-entered, even if
1521    it contains untried alternatives and there is a subsequent matching failure.
1522    </P>
1523    <P>
1524    This PCRE pattern solves the nested parentheses problem (assume the
1525    PCRE_EXTENDED option is set so that white space is ignored):
1526  <pre>  <pre>
1527    \( ( (?&#62;[^()]+) | (?R) )* \)    \( ( (?&#62;[^()]+) | (?R) )* \)
1528  </pre>  </pre>
1529  First it matches an opening parenthesis. Then it matches any number of  First it matches an opening parenthesis. Then it matches any number of
1530  substrings which can either be a sequence of non-parentheses, or a recursive  substrings which can either be a sequence of non-parentheses, or a recursive
1531  match of the pattern itself (that is a correctly parenthesized substring).  match of the pattern itself (that is, a correctly parenthesized substring).
1532  Finally there is a closing parenthesis.  Finally there is a closing parenthesis.
1533  </P>  </P>
1534  <P>  <P>
# Line 1447  matches "sense and sensibility" and "res Line 1608  matches "sense and sensibility" and "res
1608    (sens|respons)e and (?1)ibility    (sens|respons)e and (?1)ibility
1609  </pre>  </pre>
1610  is used, it does match "sense and responsibility" as well as the other two  is used, it does match "sense and responsibility" as well as the other two
1611  strings. Such references must, however, follow the subpattern to which they  strings. Such references, if given numerically, must follow the subpattern to
1612  refer.  which they refer. However, named references can refer to later subpatterns.
1613    </P>
1614    <P>
1615    Like recursive subpatterns, a "subroutine" call is always treated as an atomic
1616    group. That is, once it has matched some of the subject string, it is never
1617    re-entered, even if it contains untried alternatives and there is a subsequent
1618    matching failure.
1619  </P>  </P>
1620  <br><a name="SEC20" href="#TOC1">CALLOUTS</a><br>  <br><a name="SEC20" href="#TOC1">CALLOUTS</a><br>
1621  <P>  <P>
# Line 1486  description of the interface to the call Line 1653  description of the interface to the call
1653  documentation.  documentation.
1654  </P>  </P>
1655  <P>  <P>
1656  Last updated: 28 February 2005  Last updated: 06 June 2006
1657  <br>  <br>
1658  Copyright &copy; 1997-2005 University of Cambridge.  Copyright &copy; 1997-2006 University of Cambridge.
1659  <p>  <p>
1660  Return to the <a href="index.html">PCRE index page</a>.  Return to the <a href="index.html">PCRE index page</a>.
1661  </p>  </p>

Legend:
Removed from v.77  
changed lines
  Added in v.91

  ViewVC Help
Powered by ViewVC 1.1.5