/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 461 by ph10, Mon Oct 5 10:59:35 2009 UTC revision 469 by ph10, Mon Oct 19 14:38:48 2009 UTC
# Line 4904  RECURSIVE PATTERNS Line 4904  RECURSIVE PATTERNS
4904         so many different ways the + and * repeats can carve  up  the  subject,         so many different ways the + and * repeats can carve  up  the  subject,
4905         and all have to be tested before failure can be reported.         and all have to be tested before failure can be reported.
4906    
4907         At the end of a match, the values set for any capturing subpatterns are         At  the  end  of a match, the values of capturing parentheses are those
4908         those from the outermost level of the recursion at which the subpattern         from the outermost level. If you want to obtain intermediate values,  a
4909         value  is  set.   If  you want to obtain intermediate values, a callout         callout  function can be used (see below and the pcrecallout documenta-
4910         function can be used (see below and the pcrecallout documentation).  If         tion). If the pattern above is matched against
        the pattern above is matched against  
4911    
4912           (ab(cd)ef)           (ab(cd)ef)
4913    
4914         the  value  for  the  capturing  parentheses is "ef", which is the last         the value for the inner capturing parentheses  (numbered  2)  is  "ef",
4915         value taken on at the top level. If additional parentheses  are  added,         which  is the last value taken on at the top level. If a capturing sub-
4916         giving         pattern is not matched at the top level, its final value is unset, even
4917           if it is (temporarily) set at a deeper level.
4918           \( ( ( [^()]++ | (?R) )* ) \)  
4919              ^                        ^         If  there are more than 15 capturing parentheses in a pattern, PCRE has
4920              ^                        ^         to obtain extra memory to store data during a recursion, which it  does
4921           by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
4922         the  string  they  capture is "ab(cd)ef", the contents of the top level         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
        parentheses. If there are more than 15 capturing parentheses in a  pat-  
        tern, PCRE has to obtain extra memory to store data during a recursion,  
        which it does by using pcre_malloc, freeing  it  via  pcre_free  after-  
        wards.  If  no  memory  can  be  obtained,  the  match  fails  with the  
        PCRE_ERROR_NOMEMORY error.  
4923    
4924         Do not confuse the (?R) item with the condition (R),  which  tests  for         Do not confuse the (?R) item with the condition (R),  which  tests  for
4925         recursion.   Consider  this pattern, which matches text in angle brack-         recursion.   Consider  this pattern, which matches text in angle brack-
# Line 5039  SUBPATTERNS AS SUBROUTINES Line 5033  SUBPATTERNS AS SUBROUTINES
5033         two strings. Another example is  given  in  the  discussion  of  DEFINE         two strings. Another example is  given  in  the  discussion  of  DEFINE
5034         above.         above.
5035    
5036         Like recursive subpatterns, a "subroutine" call is always treated as an         Like  recursive  subpatterns, a subroutine call is always treated as an
5037         atomic group. That is, once it has matched some of the subject  string,         atomic group. That is, once it has matched some of the subject  string,
5038         it  is  never  re-entered, even if it contains untried alternatives and         it  is  never  re-entered, even if it contains untried alternatives and
5039         there is a subsequent matching failure.         there is a subsequent matching failure. Any capturing parentheses  that
5040           are  set  during  the  subroutine  call revert to their previous values
5041           afterwards.
5042    
5043         When a subpattern is used as a subroutine, processing options  such  as         When a subpattern is used as a subroutine, processing options  such  as
5044         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
# Line 5125  BACKTRACKING CONTROL Line 5121  BACKTRACKING CONTROL
5121         (*FAIL), which behaves like a failing negative assertion, they cause an         (*FAIL), which behaves like a failing negative assertion, they cause an
5122         error if encountered by pcre_dfa_exec().         error if encountered by pcre_dfa_exec().
5123    
5124         If any of these verbs are used in an assertion subpattern, their effect         If any of these verbs are used in an assertion or subroutine subpattern
5125         is confined to that subpattern; it does not extend to  the  surrounding         (including recursive subpatterns), their effect  is  confined  to  that
5126         pattern.   Note that assertion subpatterns are processed as anchored at         subpattern;  it  does  not extend to the surrounding pattern. Note that
5127         the point where they are tested.         such subpatterns are processed as anchored at the point where they  are
5128           tested.
5129    
5130         The new verbs make use of what was previously invalid syntax: an  open-         The  new verbs make use of what was previously invalid syntax: an open-
5131         ing parenthesis followed by an asterisk. In Perl, they are generally of         ing parenthesis followed by an asterisk. In Perl, they are generally of
5132         the form (*VERB:ARG) but PCRE does not support the use of arguments, so         the form (*VERB:ARG) but PCRE does not support the use of arguments, so
5133         its  general  form is just (*VERB). Any number of these verbs may occur         its general form is just (*VERB). Any number of these verbs  may  occur
5134         in a pattern. There are two kinds:         in a pattern. There are two kinds:
5135    
5136     Verbs that act immediately     Verbs that act immediately
# Line 5142  BACKTRACKING CONTROL Line 5139  BACKTRACKING CONTROL
5139    
5140            (*ACCEPT)            (*ACCEPT)
5141    
5142         This verb causes the match to end successfully, skipping the  remainder         This  verb causes the match to end successfully, skipping the remainder
5143         of  the pattern. When inside a recursion, only the innermost pattern is         of the pattern. When inside a recursion, only the innermost pattern  is
5144         ended immediately. If (*ACCEPT) is inside  capturing  parentheses,  the         ended  immediately.  If  (*ACCEPT) is inside capturing parentheses, the
5145         data  so  far  is  captured. (This feature was added to PCRE at release         data so far is captured. (This feature was added  to  PCRE  at  release
5146         8.00.) For example:         8.00.) For example:
5147    
5148           A((?:A|B(*ACCEPT)|C)D)           A((?:A|B(*ACCEPT)|C)D)
5149    
5150         This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
5151         tured by the outer parentheses.         tured by the outer parentheses.
5152    
5153           (*FAIL) or (*F)           (*FAIL) or (*F)
5154    
5155         This  verb  causes the match to fail, forcing backtracking to occur. It         This verb causes the match to fail, forcing backtracking to  occur.  It
5156         is equivalent to (?!) but easier to read. The Perl documentation  notes         is  equivalent to (?!) but easier to read. The Perl documentation notes
5157         that  it  is  probably  useful only when combined with (?{}) or (??{}).         that it is probably useful only when combined  with  (?{})  or  (??{}).
5158         Those are, of course, Perl features that are not present in  PCRE.  The         Those  are,  of course, Perl features that are not present in PCRE. The
5159         nearest  equivalent is the callout feature, as for example in this pat-         nearest equivalent is the callout feature, as for example in this  pat-
5160         tern:         tern:
5161    
5162           a+(?C)(*FAIL)           a+(?C)(*FAIL)
5163    
5164         A match with the string "aaaa" always fails, but the callout  is  taken         A  match  with the string "aaaa" always fails, but the callout is taken
5165         before each backtrack happens (in this example, 10 times).         before each backtrack happens (in this example, 10 times).
5166    
5167     Verbs that act after backtracking     Verbs that act after backtracking
5168    
5169         The following verbs do nothing when they are encountered. Matching con-         The following verbs do nothing when they are encountered. Matching con-
5170         tinues with what follows, but if there is no subsequent match, a  fail-         tinues  with what follows, but if there is no subsequent match, a fail-
5171         ure  is  forced.   The  verbs  differ  in  exactly what kind of failure         ure is forced.  The verbs  differ  in  exactly  what  kind  of  failure
5172         occurs.         occurs.
5173    
5174           (*COMMIT)           (*COMMIT)
5175    
5176         This verb causes the whole match to fail outright if the  rest  of  the         This  verb  causes  the whole match to fail outright if the rest of the
5177         pattern  does  not match. Even if the pattern is unanchored, no further         pattern does not match. Even if the pattern is unanchored,  no  further
5178         attempts to find a match by advancing the starting  point  take  place.         attempts  to  find  a match by advancing the starting point take place.
5179         Once  (*COMMIT)  has been passed, pcre_exec() is committed to finding a         Once (*COMMIT) has been passed, pcre_exec() is committed to  finding  a
5180         match at the current starting point, or not at all. For example:         match at the current starting point, or not at all. For example:
5181    
5182           a+(*COMMIT)b           a+(*COMMIT)b
5183    
5184         This matches "xxaab" but not "aacaab". It can be thought of as  a  kind         This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
5185         of dynamic anchor, or "I've started, so I must finish."         of dynamic anchor, or "I've started, so I must finish."
5186    
5187           (*PRUNE)           (*PRUNE)
5188    
5189         This  verb causes the match to fail at the current position if the rest         This verb causes the match to fail at the current position if the  rest
5190         of the pattern does not match. If the pattern is unanchored, the normal         of the pattern does not match. If the pattern is unanchored, the normal
5191         "bumpalong"  advance to the next starting character then happens. Back-         "bumpalong" advance to the next starting character then happens.  Back-
5192         tracking can occur as usual to the left of (*PRUNE), or  when  matching         tracking  can  occur as usual to the left of (*PRUNE), or when matching
5193         to  the right of (*PRUNE), but if there is no match to the right, back-         to the right of (*PRUNE), but if there is no match to the right,  back-
5194         tracking cannot cross (*PRUNE).  In simple cases, the use  of  (*PRUNE)         tracking  cannot  cross (*PRUNE).  In simple cases, the use of (*PRUNE)
5195         is just an alternative to an atomic group or possessive quantifier, but         is just an alternative to an atomic group or possessive quantifier, but
5196         there are some uses of (*PRUNE) that cannot be expressed in  any  other         there  are  some uses of (*PRUNE) that cannot be expressed in any other
5197         way.         way.
5198    
5199           (*SKIP)           (*SKIP)
5200    
5201         This  verb  is like (*PRUNE), except that if the pattern is unanchored,         This verb is like (*PRUNE), except that if the pattern  is  unanchored,
5202         the "bumpalong" advance is not to the next character, but to the  posi-         the  "bumpalong" advance is not to the next character, but to the posi-
5203         tion  in  the  subject where (*SKIP) was encountered. (*SKIP) signifies         tion in the subject where (*SKIP) was  encountered.  (*SKIP)  signifies
5204         that whatever text was matched leading up to it cannot  be  part  of  a         that  whatever  text  was  matched leading up to it cannot be part of a
5205         successful match. Consider:         successful match. Consider:
5206    
5207           a+(*SKIP)b           a+(*SKIP)b
5208    
5209         If  the  subject  is  "aaaac...",  after  the first match attempt fails         If the subject is "aaaac...",  after  the  first  match  attempt  fails
5210         (starting at the first character in the  string),  the  starting  point         (starting  at  the  first  character in the string), the starting point
5211         skips on to start the next attempt at "c". Note that a possessive quan-         skips on to start the next attempt at "c". Note that a possessive quan-
5212         tifer does not have the same effect as this example; although it  would         tifer  does not have the same effect as this example; although it would
5213         suppress  backtracking  during  the  first  match  attempt,  the second         suppress backtracking  during  the  first  match  attempt,  the  second
5214         attempt would start at the second character instead of skipping  on  to         attempt  would  start at the second character instead of skipping on to
5215         "c".         "c".
5216    
5217           (*THEN)           (*THEN)
5218    
5219         This verb causes a skip to the next alternation if the rest of the pat-         This verb causes a skip to the next alternation if the rest of the pat-
5220         tern does not match. That is, it cancels pending backtracking, but only         tern does not match. That is, it cancels pending backtracking, but only
5221         within  the  current  alternation.  Its name comes from the observation         within the current alternation. Its name  comes  from  the  observation
5222         that it can be used for a pattern-based if-then-else block:         that it can be used for a pattern-based if-then-else block:
5223    
5224           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
5225    
5226         If the COND1 pattern matches, FOO is tried (and possibly further  items         If  the COND1 pattern matches, FOO is tried (and possibly further items
5227         after  the  end  of  the group if FOO succeeds); on failure the matcher         after the end of the group if FOO succeeds);  on  failure  the  matcher
5228         skips to the second alternative and tries COND2,  without  backtracking         skips  to  the second alternative and tries COND2, without backtracking
5229         into  COND1.  If  (*THEN)  is  used outside of any alternation, it acts         into COND1. If (*THEN) is used outside  of  any  alternation,  it  acts
5230         exactly like (*PRUNE).         exactly like (*PRUNE).
5231    
5232    
# Line 5247  AUTHOR Line 5244  AUTHOR
5244    
5245  REVISION  REVISION
5246    
5247         Last updated: 04 October 2009         Last updated: 18 October 2009
5248         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5249  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5250    
# Line 5754  PARTIAL MATCHING USING pcre_dfa_exec() Line 5751  PARTIAL MATCHING USING pcre_dfa_exec()
5751    
5752  PARTIAL MATCHING AND WORD BOUNDARIES  PARTIAL MATCHING AND WORD BOUNDARIES
5753    
5754         If  a  pattern ends with one of sequences \w or \W, which test for word         If  a  pattern ends with one of sequences \b or \B, which test for word
5755         boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-         boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
5756         intuitive results. Consider this pattern:         intuitive results. Consider this pattern:
5757    
# Line 5861  MULTI-SEGMENT MATCHING WITH pcre_exec() Line 5858  MULTI-SEGMENT MATCHING WITH pcre_exec()
5858           data> The date is 23ja\P           data> The date is 23ja\P
5859           Partial match: 23ja           Partial match: 23ja
5860    
5861         The this stage, an application could discard the text preceding "23ja",         At  this stage, an application could discard the text preceding "23ja",
5862         add on text from the next segment, and call pcre_exec()  again.  Unlike         add on text from the next segment, and call pcre_exec()  again.  Unlike
5863         pcre_dfa_exec(),  the  entire matching string must always be available,         pcre_dfa_exec(),  the  entire matching string must always be available,
5864         and the complete matching process occurs for each call, so more  memory         and the complete matching process occurs for each call, so more  memory
# Line 5938  ISSUES WITH MULTI-SEGMENT MATCHING Line 5935  ISSUES WITH MULTI-SEGMENT MATCHING
5935    
5936         4. Patterns that contain alternatives at the top level which do not all         4. Patterns that contain alternatives at the top level which do not all
5937         start with the  same  pattern  item  may  not  work  as  expected  when         start with the  same  pattern  item  may  not  work  as  expected  when
5938         pcre_dfa_exec() is used. For example, consider this pattern:         PCRE_DFA_RESTART  is  used  with pcre_dfa_exec(). For example, consider
5939           this pattern:
5940    
5941           1234|3789           1234|3789
5942    
5943         If  the  first  part of the subject is "ABC123", a partial match of the         If the first part of the subject is "ABC123", a partial  match  of  the
5944         first alternative is found at offset 3. There is no partial  match  for         first  alternative  is found at offset 3. There is no partial match for
5945         the second alternative, because such a match does not start at the same         the second alternative, because such a match does not start at the same
5946         point in the subject string. Attempting to  continue  with  the  string         point  in  the  subject  string. Attempting to continue with the string
5947         "7890"  does  not  yield  a  match because only those alternatives that         "7890" does not yield a match  because  only  those  alternatives  that
5948         match at one point in the subject are remembered.  The  problem  arises         match  at  one  point in the subject are remembered. The problem arises
5949         because  the  start  of the second alternative matches within the first         because the start of the second alternative matches  within  the  first
5950         alternative. There is no problem with  anchored  patterns  or  patterns         alternative.  There  is  no  problem with anchored patterns or patterns
5951         such as:         such as:
5952    
5953           1234|ABCD           1234|ABCD
5954    
5955         where  no  string can be a partial match for both alternatives. This is         where no string can be a partial match for both alternatives.  This  is
5956         not a problem if pcre_exec() is used, because the entire match  has  to         not  a  problem if pcre_exec() is used, because the entire match has to
5957         be rerun each time:         be rerun each time:
5958    
5959             re> /1234|3789/             re> /1234|3789/
# Line 5964  ISSUES WITH MULTI-SEGMENT MATCHING Line 5962  ISSUES WITH MULTI-SEGMENT MATCHING
5962           data> 1237890           data> 1237890
5963            0: 3789            0: 3789
5964    
5965           Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-
5966           running the entire match can also be used with pcre_dfa_exec(). Another
5967           possibility is to work with two buffers. If a partial match at offset n
5968           in  the first buffer is followed by "no match" when PCRE_DFA_RESTART is
5969           used on the second buffer, you can then try a  new  match  starting  at
5970           offset n+1 in the first buffer.
5971    
5972    
5973  AUTHOR  AUTHOR
5974    
# Line 5974  AUTHOR Line 5979  AUTHOR
5979    
5980  REVISION  REVISION
5981    
5982         Last updated: 29 September 2009         Last updated: 19 October 2009
5983         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2009 University of Cambridge.
5984  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5985    

Legend:
Removed from v.461  
changed lines
  Added in v.469

  ViewVC Help
Powered by ViewVC 1.1.5