/[pcre]/code/trunk/doc/pcre.html
ViewVC logotype

Diff of /code/trunk/doc/pcre.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 51 by nigel, Sat Feb 24 21:39:37 2007 UTC revision 53 by nigel, Sat Feb 24 21:39:42 2007 UTC
# Line 38  conversion went wrong. Line 38  conversion went wrong.
38  <LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A>  <LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A>
39  <LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A>  <LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A>
40  <LI><A NAME="TOC30" HREF="#SEC30">UTF-8 SUPPORT</A>  <LI><A NAME="TOC30" HREF="#SEC30">UTF-8 SUPPORT</A>
41  <LI><A NAME="TOC31" HREF="#SEC31">AUTHOR</A>  <LI><A NAME="TOC31" HREF="#SEC31">SAMPLE PROGRAM</A>
42    <LI><A NAME="TOC32" HREF="#SEC32">AUTHOR</A>
43  </UL>  </UL>
44  <LI><A NAME="SEC1" HREF="#TOC1">NAME</A>  <LI><A NAME="SEC1" HREF="#TOC1">NAME</A>
45  <P>  <P>
# Line 126  use these to include support for differe Line 127  use these to include support for differe
127  </P>  </P>
128  <P>  <P>
129  The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B>  The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B>
130  are used for compiling and matching regular expressions.  are used for compiling and matching regular expressions. A sample program that
131    demonstrates the simplest way of using them is given in the file
132    <I>pcredemo.c</I>. The last section of this man page describes how to run it.
133  </P>  </P>
134  <P>  <P>
135  The functions <B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and  The functions <B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and
# Line 168  the same compiled pattern can safely be Line 171  the same compiled pattern can safely be
171  The function <B>pcre_compile()</B> is called to compile a pattern into an  The function <B>pcre_compile()</B> is called to compile a pattern into an
172  internal form. The pattern is a C string terminated by a binary zero, and  internal form. The pattern is a C string terminated by a binary zero, and
173  is passed in the argument <I>pattern</I>. A pointer to a single block of memory  is passed in the argument <I>pattern</I>. A pointer to a single block of memory
174  that is obtained via <B>pcre_malloc</B> is returned. This contains the  that is obtained via <B>pcre_malloc</B> is returned. This contains the compiled
175  compiled code and related data. The <B>pcre</B> type is defined for this for  code and related data. The <B>pcre</B> type is defined for the returned block;
176  convenience, but in fact <B>pcre</B> is just a typedef for <B>void</B>, since the  this is a typedef for a structure whose contents are not externally defined. It
177  contents of the block are not externally defined. It is up to the caller to  is up to the caller to free the memory when it is no longer required.
178  free the memory when it is no longer required.  </P>
179    <P>
180    Although the compiled code of a PCRE regex is relocatable, that is, it does not
181    depend on memory location, the complete <B>pcre</B> data block is not
182    fully relocatable, because it contains a copy of the <I>tableptr</I> argument,
183    which is an address (see below).
184  </P>  </P>
185  <P>  <P>
186  The size of a compiled pattern is roughly proportional to the length of the  The size of a compiled pattern is roughly proportional to the length of the
# Line 206  locale. Otherwise, <I>tableptr</I> must Line 214  locale. Otherwise, <I>tableptr</I> must
214  <B>pcre_maketables()</B>. See the section on locale support below.  <B>pcre_maketables()</B>. See the section on locale support below.
215  </P>  </P>
216  <P>  <P>
217    This code fragment shows a typical straightforward call to <B>pcre_compile()</B>:
218    </P>
219    <P>
220    <PRE>
221      pcre *re;
222      const char *error;
223      int erroffset;
224      re = pcre_compile(
225        "^A.*Z",          /* the pattern */
226        0,                /* default options */
227        &error,           /* for error message */
228        &erroffset,       /* for error offset */
229        NULL);            /* use default character tables */
230    </PRE>
231    </P>
232    <P>
233  The following option bits are defined in the header file:  The following option bits are defined in the header file:
234  </P>  </P>
235  <P>  <P>
# Line 329  Details of exactly what it entails are g Line 353  Details of exactly what it entails are g
353  When a pattern is going to be used several times, it is worth spending more  When a pattern is going to be used several times, it is worth spending more
354  time analyzing it in order to speed up the time taken for matching. The  time analyzing it in order to speed up the time taken for matching. The
355  function <B>pcre_study()</B> takes a pointer to a compiled pattern as its first  function <B>pcre_study()</B> takes a pointer to a compiled pattern as its first
356  argument, and returns a pointer to a <B>pcre_extra</B> block (another <B>void</B>  argument, and returns a pointer to a <B>pcre_extra</B> block (another typedef
357  typedef) containing additional information about the pattern; this can be  for a structure with hidden contents) containing additional information about
358  passed to <B>pcre_exec()</B>. If no additional information is available, NULL  the pattern; this can be passed to <B>pcre_exec()</B>. If no additional
359  is returned.  information is available, NULL is returned.
360  </P>  </P>
361  <P>  <P>
362  The second argument contains option bits. At present, no options are defined  The second argument contains option bits. At present, no options are defined
# Line 344  studying succeeds (even if no data is re Line 368  studying succeeds (even if no data is re
368  set to NULL. Otherwise it points to a textual error message.  set to NULL. Otherwise it points to a textual error message.
369  </P>  </P>
370  <P>  <P>
371    This is a typical call to <B>pcre_study</B>():
372    </P>
373    <P>
374    <PRE>
375      pcre_extra *pe;
376      pe = pcre_study(
377        re,             /* result of pcre_compile() */
378        0,              /* no options exist */
379        &error);        /* set to NULL or points to a message */
380    </PRE>
381    </P>
382    <P>
383  At present, studying a pattern is useful only for non-anchored patterns that do  At present, studying a pattern is useful only for non-anchored patterns that do
384  not have a single fixed starting character. A bitmap of possible starting  not have a single fixed starting character. A bitmap of possible starting
385  characters is created.  characters is created.
# Line 403  the following negative numbers: Line 439  the following negative numbers:
439  </PRE>  </PRE>
440  </P>  </P>
441  <P>  <P>
442    Here is a typical call of <B>pcre_fullinfo()</B>, to obtain the length of the
443    compiled pattern:
444    </P>
445    <P>
446    <PRE>
447      int rc;
448      unsigned long int length;
449      rc = pcre_fullinfo(
450        re,               /* result of pcre_compile() */
451        pe,               /* result of pcre_study(), or NULL */
452        PCRE_INFO_SIZE,   /* what is required */
453        &length);         /* where to put the data */
454    </PRE>
455    </P>
456    <P>
457  The possible values for the third argument are defined in <B>pcre.h</B>, and are  The possible values for the third argument are defined in <B>pcre.h</B>, and are
458  as follows:  as follows:
459  </P>  </P>
# Line 413  as follows: Line 464  as follows:
464  </P>  </P>
465  <P>  <P>
466  Return a copy of the options with which the pattern was compiled. The fourth  Return a copy of the options with which the pattern was compiled. The fourth
467  argument should point to au <B>unsigned long int</B> variable. These option bits  argument should point to an <B>unsigned long int</B> variable. These option bits
468  are those specified in the call to <B>pcre_compile()</B>, modified by any  are those specified in the call to <B>pcre_compile()</B>, modified by any
469  top-level option settings within the pattern itself, and with the PCRE_ANCHORED  top-level option settings within the pattern itself, and with the PCRE_ANCHORED
470  bit forcibly set if the form of the pattern implies that it can match only at  bit forcibly set if the form of the pattern implies that it can match only at
# Line 528  pattern has been studied, the result of Line 579  pattern has been studied, the result of
579  <I>extra</I> argument. Otherwise this must be NULL.  <I>extra</I> argument. Otherwise this must be NULL.
580  </P>  </P>
581  <P>  <P>
582    Here is an example of a simple call to <B>pcre_exec()</B>:
583    </P>
584    <P>
585    <PRE>
586      int rc;
587      int ovector[30];
588      rc = pcre_exec(
589        re,             /* result of pcre_compile() */
590        NULL,           /* we didn't study the pattern */
591        "some string",  /* the subject string */
592        11,             /* the length of the subject string */
593        0,              /* start at offset 0 in the subject */
594        0,              /* default options */
595        ovector,        /* vector for substring information */
596        30);            /* number of elements in the vector */
597    </PRE>
598    </P>
599    <P>
600  The PCRE_ANCHORED option can be passed in the <I>options</I> argument, whose  The PCRE_ANCHORED option can be passed in the <I>options</I> argument, whose
601  unused bits must be zero. However, if a pattern was compiled with  unused bits must be zero. However, if a pattern was compiled with
602  PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it  PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it
# Line 588  below) and trying an ordinary match agai Line 657  below) and trying an ordinary match agai
657  <P>  <P>
658  The subject string is passed as a pointer in <I>subject</I>, a length in  The subject string is passed as a pointer in <I>subject</I>, a length in
659  <I>length</I>, and a starting offset in <I>startoffset</I>. Unlike the pattern  <I>length</I>, and a starting offset in <I>startoffset</I>. Unlike the pattern
660  string, it may contain binary zero characters. When the starting offset is  string, the subject may contain binary zero characters. When the starting
661  zero, the search for a match starts at the beginning of the subject, and this  offset is zero, the search for a match starts at the beginning of the subject,
662  is by far the most common case.  and this is by far the most common case.
663  </P>  </P>
664  <P>  <P>
665  A non-zero starting offset is useful when searching for another match in the  A non-zero starting offset is useful when searching for another match in the
# Line 833  There are some size limitations in PCRE Line 902  There are some size limitations in PCRE
902  practice be relevant.  practice be relevant.
903  The maximum length of a compiled pattern is 65539 (sic) bytes.  The maximum length of a compiled pattern is 65539 (sic) bytes.
904  All values in repeating quantifiers must be less than 65536.  All values in repeating quantifiers must be less than 65536.
905  The maximum number of capturing subpatterns is 99.  There maximum number of capturing subpatterns is 65535.
906  The maximum number of all parenthesized subpatterns, including capturing  There is no limit to the number of non-capturing subpatterns, but the maximum
907    depth of nesting of all kinds of parenthesized subpattern, including capturing
908  subpatterns, assertions, and other types of subpattern, is 200.  subpatterns, assertions, and other types of subpattern, is 200.
909  </P>  </P>
910  <P>  <P>
# Line 1225  PCRE_MULTILINE is set. Line 1295  PCRE_MULTILINE is set.
1295  <P>  <P>
1296  Note that the sequences \A, \Z, and \z can be used to match the start and  Note that the sequences \A, \Z, and \z can be used to match the start and
1297  end of the subject in both modes, and if all branches of a pattern start with  end of the subject in both modes, and if all branches of a pattern start with
1298  \A is it always anchored, whether PCRE_MULTILINE is set or not.  \A it is always anchored, whether PCRE_MULTILINE is set or not.
1299  </P>  </P>
1300  <LI><A NAME="SEC16" HREF="#TOC1">FULL STOP (PERIOD, DOT)</A>  <LI><A NAME="SEC16" HREF="#TOC1">FULL STOP (PERIOD, DOT)</A>
1301  <P>  <P>
# Line 1350  negation, which is indicated by a ^ char Line 1420  negation, which is indicated by a ^ char
1420  </PRE>  </PRE>
1421  </P>  </P>
1422  <P>  <P>
1423  matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX  matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
1424  syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not  syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1425  supported, and an error is given if they are encountered.  supported, and an error is given if they are encountered.
1426  </P>  </P>
# Line 1482  For example, if the string "the red king Line 1552  For example, if the string "the red king
1552  </P>  </P>
1553  <P>  <P>
1554  the captured substrings are "red king", "red", and "king", and are numbered 1,  the captured substrings are "red king", "red", and "king", and are numbered 1,
1555  2, and 3.  2, and 3, respectively.
1556  </P>  </P>
1557  <P>  <P>
1558  The fact that plain parentheses fulfil two functions is not always helpful.  The fact that plain parentheses fulfil two functions is not always helpful.
# Line 2375  The following UTF-8 features of Perl 5.6 Line 2445  The following UTF-8 features of Perl 5.6
2445  <P>  <P>
2446  2. The use of Unicode tables and properties and escapes \p, \P, and \X.  2. The use of Unicode tables and properties and escapes \p, \P, and \X.
2447  </P>  </P>
2448  <LI><A NAME="SEC31" HREF="#TOC1">AUTHOR</A>  <LI><A NAME="SEC31" HREF="#TOC1">SAMPLE PROGRAM</A>
2449    <P>
2450    The code below is a simple, complete demonstration program, to get you started
2451    with using PCRE. This code is also supplied in the file <I>pcredemo.c</I> in the
2452    PCRE distribution.
2453    </P>
2454    <P>
2455    The program compiles the regular expression that is its first argument, and
2456    matches it against the subject string in its second argument. No options are
2457    set, and default character tables are used. If matching succeeds, the program
2458    outputs the portion of the subject that matched, together with the contents of
2459    any captured substrings.
2460    </P>
2461    <P>
2462    On a Unix system that has PCRE installed in <I>/usr/local</I>, you can compile
2463    the demonstration program using a command like this:
2464    </P>
2465    <P>
2466    <PRE>
2467      gcc -o pcredemo pcredemo.c -I/usr/local/include -L/usr/local/lib -lpcre
2468    </PRE>
2469    </P>
2470    <P>
2471    Then you can run simple tests like this:
2472    </P>
2473    <P>
2474    <PRE>
2475      ./pcredemo 'cat|dog' 'the cat sat on the mat'
2476    </PRE>
2477    </P>
2478    <P>
2479    Note that there is a much more comprehensive test program, called
2480    <B>pcretest</B>, which supports many more facilities for testing regular
2481    expressions. The <B>pcredemo</B> program is provided as a simple coding example.
2482    </P>
2483    <P>
2484    On some operating systems (e.g. Solaris) you may get an error like this when
2485    you try to run <B>pcredemo</B>:
2486    </P>
2487    <P>
2488    <PRE>
2489      ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or directory
2490    </PRE>
2491    </P>
2492    <P>
2493    This is caused by the way shared library support works on those systems. You
2494    need to add
2495    </P>
2496    <P>
2497    <PRE>
2498      -R/usr/local/lib
2499    </PRE>
2500    </P>
2501    <P>
2502    to the compile command to get round this problem. Here's the code:
2503    </P>
2504    <P>
2505    <PRE>
2506      #include &#60;stdio.h&#62;
2507      #include &#60;string.h&#62;
2508      #include &#60;pcre.h&#62;
2509    </PRE>
2510    </P>
2511    <P>
2512    <PRE>
2513      #define OVECCOUNT 30    /* should be a multiple of 3 */
2514    </PRE>
2515    </P>
2516    <P>
2517    <PRE>
2518      int main(int argc, char **argv)
2519      {
2520      pcre *re;
2521      const char *error;
2522      int erroffset;
2523      int ovector[OVECCOUNT];
2524      int rc, i;
2525    </PRE>
2526    </P>
2527    <P>
2528    <PRE>
2529      if (argc != 3)
2530        {
2531        printf("Two arguments required: a regex and a "
2532          "subject string\n");
2533        return 1;
2534        }
2535    </PRE>
2536    </P>
2537    <P>
2538    <PRE>
2539      /* Compile the regular expression in the first argument */
2540    </PRE>
2541    </P>
2542    <P>
2543    <PRE>
2544      re = pcre_compile(
2545        argv[1],     /* the pattern */
2546        0,           /* default options */
2547        &error,      /* for error message */
2548        &erroffset,  /* for error offset */
2549        NULL);       /* use default character tables */
2550    </PRE>
2551    </P>
2552    <P>
2553    <PRE>
2554      /* Compilation failed: print the error message and exit */
2555    </PRE>
2556    </P>
2557    <P>
2558    <PRE>
2559      if (re == NULL)
2560        {
2561        printf("PCRE compilation failed at offset %d: %s\n",
2562          erroffset, error);
2563        return 1;
2564        }
2565    </PRE>
2566    </P>
2567    <P>
2568    <PRE>
2569      /* Compilation succeeded: match the subject in the second
2570         argument */
2571    </PRE>
2572    </P>
2573    <P>
2574    <PRE>
2575      rc = pcre_exec(
2576        re,          /* the compiled pattern */
2577        NULL,        /* we didn't study the pattern */
2578        argv[2],     /* the subject string */
2579        (int)strlen(argv[2]), /* the length of the subject */
2580        0,           /* start at offset 0 in the subject */
2581        0,           /* default options */
2582        ovector,     /* vector for substring information */
2583        OVECCOUNT);  /* number of elements in the vector */
2584    </PRE>
2585    </P>
2586    <P>
2587    <PRE>
2588      /* Matching failed: handle error cases */
2589    </PRE>
2590    </P>
2591    <P>
2592    <PRE>
2593      if (rc &#60; 0)
2594        {
2595        switch(rc)
2596          {
2597          case PCRE_ERROR_NOMATCH: printf("No match\n"); break;
2598          /*
2599          Handle other special cases if you like
2600          */
2601          default: printf("Matching error %d\n", rc); break;
2602          }
2603        return 1;
2604        }
2605    </PRE>
2606    </P>
2607    <P>
2608    <PRE>
2609      /* Match succeded */
2610    </PRE>
2611    </P>
2612    <P>
2613    <PRE>
2614      printf("Match succeeded\n");
2615    </PRE>
2616    </P>
2617    <P>
2618    <PRE>
2619      /* The output vector wasn't big enough */
2620    </PRE>
2621    </P>
2622    <P>
2623    <PRE>
2624      if (rc == 0)
2625        {
2626        rc = OVECCOUNT/3;
2627        printf("ovector only has room for %d captured "
2628          substrings\n", rc - 1);
2629        }
2630    </PRE>
2631    </P>
2632    <P>
2633    <PRE>
2634      /* Show substrings stored in the output vector */
2635    </PRE>
2636    </P>
2637    <P>
2638    <PRE>
2639      for (i = 0; i &#60; rc; i++)
2640        {
2641        char *substring_start = argv[2] + ovector[2*i];
2642        int substring_length = ovector[2*i+1] - ovector[2*i];
2643        printf("%2d: %.*s\n", i, substring_length,
2644          substring_start);
2645        }
2646    </PRE>
2647    </P>
2648    <P>
2649    <PRE>
2650      return 0;
2651      }
2652    </PRE>
2653    </P>
2654    <LI><A NAME="SEC32" HREF="#TOC1">AUTHOR</A>
2655  <P>  <P>
2656  Philip Hazel &#60;ph10@cam.ac.uk&#62;  Philip Hazel &#60;ph10@cam.ac.uk&#62;
2657  <BR>  <BR>
# Line 2388  Cambridge CB2 3QG, England. Line 2664  Cambridge CB2 3QG, England.
2664  Phone: +44 1223 334714  Phone: +44 1223 334714
2665  </P>  </P>
2666  <P>  <P>
2667  Last updated: 28 August 2000,  Last updated: 15 August 2001
 <BR>  
 <PRE>  
   the 250th anniversary of the death of J.S. Bach.  
2668  <BR>  <BR>
2669  </PRE>  Copyright (c) 1997-2001 University of Cambridge.
 Copyright (c) 1997-2000 University of Cambridge.  

Legend:
Removed from v.51  
changed lines
  Added in v.53

  ViewVC Help
Powered by ViewVC 1.1.5