--- code/trunk/doc/html/pcretest.html 2007/02/24 21:40:45 77 +++ code/trunk/doc/html/pcretest.html 2007/02/24 21:41:42 93 @@ -23,15 +23,16 @@
  • OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
  • RESTARTING AFTER A PARTIAL MATCH
  • CALLOUTS -
  • SAVING AND RELOADING COMPILED PATTERNS -
  • AUTHOR +
  • NON-PRINTING CHARACTERS +
  • SAVING AND RELOADING COMPILED PATTERNS +
  • SEE ALSO +
  • AUTHOR
    SYNOPSIS

    -pcretest [-C] [-d] [-dfa] [-i] [-m] [-o osize] [-p] [-t] [source] -[destination] -

    -

    +pcretest [options] [source] [destination] +
    +
    pcretest was written as a test program for the PCRE regular expression library itself, but it can also be used for experimenting with regular expressions. This document describes the features of the test program; for @@ -44,6 +45,11 @@


    OPTIONS

    +-b +Behave as if each regex has the /B (show bytecode) modifier; the internal +form is output after compilation. +

    +

    -C Output the version number of the PCRE library, and all available information about the optional features that are included, and then exit. @@ -51,7 +57,8 @@

    -d Behave as if each regex has the /D (debug) modifier; the internal -form is output after compilation. +form and information about the compiled pattern is output after compilation; +-d is equivalent to -b -i.

    -dfa @@ -60,6 +67,10 @@ standard pcre_exec() function (more detail is given below).

    +-help +Output a brief summary these options and then exit. +

    +

    -i Behave as if each regex has the /I modifier; information about the compiled pattern is given after compilation. @@ -73,9 +84,11 @@

    -o osize Set the number of elements in the output vector that is used when calling -pcre_exec() to be osize. The default value is 45, which is enough -for 14 capturing subexpressions. The vector size can be changed for individual -matching calls by including \O in the data line (see below). +pcre_exec() or pcre_dfa_exec() to be osize. The default value +is 45, which is enough for 14 capturing subexpressions for pcre_exec() or +22 different matches for pcre_dfa_exec(). The vector size can be +changed for individual matching calls by including \O in the data line (see +below).

    -p @@ -84,11 +97,28 @@ set.

    +-q +Do not output the version number of pcretest at the start of execution. +

    +

    +-S size +On Unix-like systems, set the size of the runtime stack to size +megabytes. +

    +

    -t Run each compile, study, and match many times with a timer, and output resulting time per compile or match (in milliseconds). Do not set -m with -t, because you will then get the size output a zillion times, and the -timing will be distorted. +timing will be distorted. You can control the number of iterations that are +used for timing by following -t with a number (as a separate item on the +command line). For example, "-t 1000" would iterate 1000 times. The default is +to iterate 500000 times. +

    +

    +-tm +This is like -t except that it times only the matching phase, not the +compile or study phases.


    DESCRIPTION

    @@ -105,14 +135,15 @@

    Each data line is matched separately and independently. If you want to do -multiple-line matches, you have to use the \n escape sequence in a single line -of input to encode the newline characters. The maximum length of data line is -30,000 characters. +multi-line matches, you have to use the \n escape sequence (or \r or \r\n, +etc., depending on the newline setting) in a single line of input to encode the +newline sequences. There is no limit on the length of data lines; the input +buffer is automatically extended if it is too small.

    An empty line signals the end of the data lines, at which point a new regular expression is read. The regular expressions are given enclosed in any -non-alphanumeric delimiters other than backslash, for example +non-alphanumeric delimiters other than backslash, for example:

       /(a|bc)x+yz/
     
    @@ -159,14 +190,32 @@ The following table shows additional modifiers for setting PCRE options that do not correspond to anything in Perl:
    -  /A    PCRE_ANCHORED
    -  /C    PCRE_AUTO_CALLOUT
    -  /E    PCRE_DOLLAR_ENDONLY
    -  /f    PCRE_FIRSTLINE
    -  /N    PCRE_NO_AUTO_CAPTURE
    -  /U    PCRE_UNGREEDY
    -  /X    PCRE_EXTRA
    +  /A       PCRE_ANCHORED
    +  /C       PCRE_AUTO_CALLOUT
    +  /E       PCRE_DOLLAR_ENDONLY
    +  /f       PCRE_FIRSTLINE
    +  /J       PCRE_DUPNAMES
    +  /N       PCRE_NO_AUTO_CAPTURE
    +  /U       PCRE_UNGREEDY
    +  /X       PCRE_EXTRA
    +  /<cr>    PCRE_NEWLINE_CR
    +  /<lf>    PCRE_NEWLINE_LF
    +  /<crlf>  PCRE_NEWLINE_CRLF
    +  /<any>   PCRE_NEWLINE_ANY
     
    +Those specifying line ending sequencess are literal strings as shown. This +example sets multiline matching with CRLF as the line ending sequence: +
    +  /^abc/m<crlf>
    +
    +Details of the meanings of these PCRE options are given in the +pcreapi +documentation. +

    +
    +Finding all matches in a string +
    +

    Searching for all possible matches within each subject string can be requested by the /g or /G modifier. After finding a match, PCRE is called again to search the remainder of the subject string. The difference between @@ -184,6 +233,9 @@ match is retried. This imitates the way Perl handles such cases when using the /g modifier or the split() function.

    +
    +Other modifiers +

    There are yet more modifiers for controlling the way pcretest operates. @@ -195,6 +247,10 @@ multiple copies of the same substring.

    +The /B modifier is a debugging feature. It requests that pcretest +output a representation of the compiled byte code after compilation. +

    +

    The /L modifier must be followed directly by the name of a locale, for example,

    @@ -213,10 +269,8 @@
     pattern. If the pattern is studied, the results of that are also output.
     

    -The /D modifier is a PCRE debugging feature, which also assumes /I. -It causes the internal form of compiled regular expressions to be output after -compilation. If the pattern was studied, the information returned is also -output. +The /D modifier is a PCRE debugging feature, and is equivalent to +/BI, that is, both the \fP/B\fP and the /I modifiers.

    The /F modifier causes pcretest to flip the byte order of the @@ -264,19 +318,20 @@ expressions, you probably don't need any of these. The following escapes are recognized:

    -  \a         alarm (= BEL)
    -  \b         backspace
    -  \e         escape
    -  \f         formfeed
    -  \n         newline
    -  \r         carriage return
    -  \t         tab
    -  \v         vertical tab
    +  \a         alarm (BEL, \x07)
    +  \b         backspace (\x08)
    +  \e         escape (\x27)
    +  \f         formfeed (\x0c)
    +  \n         newline (\x0a)
    +  \qdd       set the PCRE_MATCH_LIMIT limit to dd (any number of digits)
    +  \r         carriage return (\x0d)
    +  \t         tab (\x09)
    +  \v         vertical tab (\x0b)
       \nnn       octal character (up to 3 octal digits)
       \xhh       hexadecimal character (up to 2 hex digits)
       \x{hh...}  hexadecimal character, any number of digits in UTF-8 mode
    -  \A         pass the PCRE_ANCHORED option to pcre_exec()
    -  \B         pass the PCRE_NOTBOL option to pcre_exec()
    +  \A         pass the PCRE_ANCHORED option to pcre_exec() or pcre_dfa_exec()
    +  \B         pass the PCRE_NOTBOL option to pcre_exec() or pcre_dfa_exec()
       \Cdd       call pcre_copy_substring() for substring dd after a successful match (number less than 32)
       \Cname     call pcre_copy_named_substring() for substring "name" after a successful match (name termin-
                    ated by next non alphanumeric character)
    @@ -291,30 +346,43 @@
       \Gname     call pcre_get_named_substring() for substring "name" after a successful match (name termin-
                    ated by next non-alphanumeric character)
       \L         call pcre_get_substringlist() after a successful match
    -  \M         discover the minimum MATCH_LIMIT setting
    -  \N         pass the PCRE_NOTEMPTY option to pcre_exec()
    +  \M         discover the minimum MATCH_LIMIT and MATCH_LIMIT_RECURSION settings
    +  \N         pass the PCRE_NOTEMPTY option to pcre_exec() or pcre_dfa_exec()
       \Odd       set the size of the output vector passed to pcre_exec() to dd (any number of digits)
       \P         pass the PCRE_PARTIAL option to pcre_exec() or pcre_dfa_exec()
    +  \Qdd       set the PCRE_MATCH_LIMIT_RECURSION limit to dd (any number of digits)
       \R         pass the PCRE_DFA_RESTART option to pcre_dfa_exec()
       \S         output details of memory get/free calls during matching
    -  \Z         pass the PCRE_NOTEOL option to pcre_exec()
    -  \?         pass the PCRE_NO_UTF8_CHECK option to pcre_exec()
    +  \Z         pass the PCRE_NOTEOL option to pcre_exec() or pcre_dfa_exec()
    +  \?         pass the PCRE_NO_UTF8_CHECK option to pcre_exec() or pcre_dfa_exec()
       \>dd       start the match at offset dd (any number of digits);
    -               this sets the startoffset argument for pcre_exec()
    -
    -A backslash followed by anything else just escapes the anything else. If the -very last character is a backslash, it is ignored. This gives a way of passing -an empty line as data, since a real empty line terminates the data input. + this sets the startoffset argument for pcre_exec() or pcre_dfa_exec() + \<cr> pass the PCRE_NEWLINE_CR option to pcre_exec() or pcre_dfa_exec() + \<lf> pass the PCRE_NEWLINE_LF option to pcre_exec() or pcre_dfa_exec() + \<crlf> pass the PCRE_NEWLINE_CRLF option to pcre_exec() or pcre_dfa_exec() + \<any> pass the PCRE_NEWLINE_ANY option to pcre_exec() or pcre_dfa_exec() +
    +The escapes that specify line ending sequences are literal strings, exactly as +shown. No more than one newline setting should be present in any data line. +

    +

    +A backslash followed by anything else just escapes the anything else. If +the very last character is a backslash, it is ignored. This gives a way of +passing an empty line as data, since a real empty line terminates the data +input.

    If \M is present, pcretest calls pcre_exec() several times, with -different values in the match_limit field of the pcre_extra data -structure, until it finds the minimum number that is needed for -pcre_exec() to complete. This number is a measure of the amount of -recursion and backtracking that takes place, and checking it out can be -instructive. For most simple matches, the number is quite small, but for -patterns with very large numbers of matching possibilities, it can become large -very quickly with increasing length of subject string. +different values in the match_limit and match_limit_recursion +fields of the pcre_extra data structure, until it finds the minimum +numbers for each parameter that allow pcre_exec() to complete. The +match_limit number is a measure of the amount of backtracking that takes +place, and checking it out can be instructive. For most simple matches, the +number is quite small, but for patterns with very large numbers of matching +possibilities, it can become large very quickly with increasing length of +subject string. The match_limit_recursion number is a measure of how much +stack (or, if PCRE is compiled with NO_RECURSE, how much heap) memory is needed +to complete the match attempt.

    When \O is used, the value specified may be higher or lower than the size set @@ -323,8 +391,9 @@

    If the /P modifier was present on the pattern, causing the POSIX wrapper -API to be used, only \B and \Z have any effect, causing REG_NOTBOL and -REG_NOTEOL to be passed to regexec() respectively. +API to be used, the only option-setting sequences that have any effect are \B +and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively, to be passed to +regexec().

    The use of \x{hh...} to represent UTF-8 characters is not dependent on the use @@ -363,7 +432,7 @@ of an interactive pcretest run.

       $ pcretest
    -  PCRE version 5.00 07-Sep-2004
    +  PCRE version 7.0 30-Nov-2006
     
         re> /^abc(\d+)/
       data> abc123
    @@ -374,9 +443,9 @@
     
    If the strings contain any non-printing characters, they are output as \0x escapes, or as \x{...} escapes if the /8 modifier was present on the -pattern. If the pattern has the /+ modifier, the output for substring 0 -is followed by the the rest of the subject string, identified by "0+" like -this: +pattern. See below for the definition of non-printing characters. If the +pattern has the /+ modifier, the output for substring 0 is followed by +the the rest of the subject string, identified by "0+" like this:
         re> /cat/+
       data> cataract
    @@ -406,9 +475,10 @@
     parentheses after each string for \C and \G.
     

    -Note that while patterns can be continued over several lines (a plain ">" +Note that whereas patterns can be continued over several lines (a plain ">" prompt is used for continuations), data lines may not. However newlines can be -included in data by means of the \n escape. +included in data by means of the \n escape (or \r, \r\n, etc., depending on +the newline sequence setting).


    OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION

    @@ -427,7 +497,7 @@ longest matching string is always given first (and numbered zero).

    -If \fB/g\P is present on the pattern, the search for further matches resumes +If /g is present on the pattern, the search for further matches resumes at the end of the longest match. For example:

         re> /(tang|tangerine|tan)/g
    @@ -501,7 +571,19 @@
     pcrecallout
     documentation.
     

    -
    SAVING AND RELOADING COMPILED PATTERNS
    +
    NON-PRINTING CHARACTERS
    +

    +When pcretest is outputting text in the compiled version of a pattern, +bytes other than 32-126 are always treated as non-printing characters are are +therefore shown as hex escapes. +

    +

    +When pcretest is outputting text that is a matched part of a subject +string, it behaves in the same way, unless a different locale has been set for +the pattern (using the /L modifier). In this case, the isprint() +function to distinguish printing and non-printing characters. +

    +
    SAVING AND RELOADING COMPILED PATTERNS

    The facilities described in this section are not available when the POSIX inteface to PCRE is being used, that is, when the /P pattern modifier is @@ -563,18 +645,23 @@ Finally, if you attempt to load a file that is not in the correct format, the result is undefined.

    -
    AUTHOR
    +
    SEE ALSO
    +

    +pcre(3), pcreapi(3), pcrecallout(3), pcrematching(3), +pcrepartial(d), \fPpcrepattern\fP(3), pcreprecompile(3). +

    +
    AUTHOR

    Philip Hazel
    University Computing Service,
    -Cambridge CB2 3QG, England. +Cambridge CB2 3QH, England.

    -Last updated: 28 February 2005 +Last updated: 30 November 2006
    -Copyright © 1997-2005 University of Cambridge. +Copyright © 1997-2006 University of Cambridge.

    Return to the PCRE index page.