Changeset 13851 in project


Ignore:
Timestamp:
03/20/09 21:47:37 (11 years ago)
Author:
felix winkelmann
Message:

documented SREs, some script and manaul fixes

Location:
chicken/trunk
Files:
1 added
3 edited

Legend:

Unmodified
Added
Removed
  • chicken/trunk/manual/The User's Manual

    r13710 r13851  
    11[[tags:manual]]
    22
    3 [[image:http://www.call-with-current-continuation.org/chicken.png]]
     3[[image:http://www.call-with-current-continuation.org/chicken4.png]]
    44
    55== The CHICKEN User's Manual
  • chicken/trunk/manual/Unit regex

    r13740 r13851  
    1111This library unit exposes two APIs: the one listed below and the
    1212original irregex API. To use the latter, import from the {{irregex}} module.
     13
     14Regular expressions may be either POSIX-style strings (with most PCRE
     15extensions) or an SCSH-style SRE. There is no {{(rx ...)}} syntax -
     16just use normal Scheme lists, with quasiquote if you like.
     17
    1318
    1419
     
    192197</enscript>
    193198
     199=== Extended SRE Syntax
     200
     201The following table summarizes the SRE syntax, with detailed explanations following.
     202
     203  ;; basic patterns
     204  <string>                          ; literal string
     205  (seq <sre> ...)                   ; sequence
     206  (: <sre> ...)
     207  (or <sre> ...)                    ; alternation
     208 
     209  ;; optional/multiple patterns
     210  (? <sre> ...)                     ; 0 or 1 matches
     211  (* <sre> ...)                     ; 0 or more matches
     212  (+ <sre> ...)                     ; 1 or more matches
     213  (= <n> <sre> ...)                 ; exactly <n> matches
     214  (>= <n> <sre> ...)                ; <n> or more matches
     215  (** <from> <to> <sre> ...)        ; <n> to <m> matches
     216  (?? <sre> ...)                    ; non-greedy (non-greedy) pattern: (0 or 1)
     217  (*? <sre> ...)                    ; non-greedy kleene star
     218  (**? <from> <to> <sre> ...)       ; non-greedy range
     219 
     220  ;; submatch patterns
     221  (submatch <sre> ...)              ; numbered submatch
     222  (submatch-named <name> <sre> ...) ; named submatch
     223  (=> <name> <sre> ...)
     224  (backref <n-or-name>)             ; match a previous submatch
     225 
     226  ;; toggling case-sensitivity
     227  (w/case <sre> ...)                ; enclosed <sre>s are case-sensitive
     228  (w/nocase <sre> ...)              ; enclosed <sre>s are case-insensitive
     229 
     230  ;; character sets
     231  <char>                            ; singleton char set
     232  (<string>)                        ; set of chars
     233  (or <cset-sre> ...)               ; set union
     234  (~ <cset-sre> ...)                ; set complement (i.e. [^...])
     235  (- <cset-sre> ...)                ; set difference
     236  (& <cset-sre> ...)                ; set intersection
     237  (/ <range-spec> ...)              ; pairs of chars as ranges
     238 
     239  ;; named character sets
     240  any
     241  nonl
     242  ascii
     243  lower-case     lower
     244  upper-case     upper
     245  alphabetic     alpha
     246  numeric        num
     247  alphanumeric   alphanum  alnum
     248  punctuation    punct
     249  graphic        graph
     250  whitespace     white     space
     251  printing       print
     252  control        cntrl
     253  hex-digit      xdigit
     254 
     255  ;; assertions and conditionals
     256  bos eos                           ; beginning/end of string
     257  bol eol                           ; beginning/end of line
     258  bow eow                           ; beginning/end of word
     259  nwb                               ; non-word-boundary
     260  (look-ahead <sre> ...)            ; zero-width look-ahead assertion
     261  (look-behind <sre> ...)           ; zero-width look-behind assertion
     262  (neg-look-ahead <sre> ...)        ; zero-width negative look-ahead assertion
     263  (neg-look-behind <sre> ...)       ; zero-width negative look-behind assertion
     264  (atomic <sre> ...)                ; for (?>...) independent patterns
     265  (if <test> <pass> [<fail>])       ; conditional patterns
     266  commit                            ; don't backtrack beyond this (i.e. cut)
     267 
     268  ;; backwards compatibility
     269  (posix-string <string>)           ; embed a POSIX string literal
     270
     271====  Basic SRE Patterns
     272
     273The simplest SRE is a literal string, which matches that string exactly.
     274
     275  (string-search "needle" "hayneedlehay") => <match>
     276
     277By default the match is case-sensitive, though you can control this either with the compiler flags or local overrides:
     278
     279  (string-search "needle" "haynEEdlehay") => #f
     280 
     281  (string-search (irregex "needle" 'i) "haynEEdlehay") => <match>
     282 
     283  (string-search '(w/nocase "needle") "haynEEdlehay") => <match>
     284
     285You can use {{w/case}} to switch back to case-sensitivity inside a {{w/nocase}}:
     286
     287  (string-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => <match>
     288 
     289  (string-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f
     290
     291Of course, literal strings by themselves aren't very interesting
     292regular expressions, so we want to be able to compose them. The most
     293basic way to do this is with the {{seq}} operator (or its abbreviation {{:}}),
     294which matches one or more patterns consecutively:
     295
     296  (string-search '(: "one" space "two" space "three") "one two three") => <match>
     297
     298As you may have noticed above, the {{w/case}} and {{w/nocase}} operators
     299allowed multiple SREs in a sequence - other operators that take any
     300number of arguments (e.g. the repetition operators below) allow such
     301implicit sequences.
     302
     303To match any one of a set of patterns use the or alternation operator:
     304
     305  (string-search '(or "eeney" "meeney" "miney") "meeney") => <match>
     306
     307  (string-search '(or "eeney" "meeney" "miney") "moe") => #f
     308
     309====  SRE Repetition Patterns
     310
     311There are also several ways to control the number of times a pattern
     312is matched. The simplest of these is {{?}} which just optionally matches
     313the pattern:
     314
     315  (string-search '(: "match" (? "es") "!") "matches!") => <match>
     316 
     317  (string-search '(: "match" (? "es") "!") "match!") => <match>
     318 
     319  (string-search '(: "match" (? "es") "!") "matche!") => #f
     320
     321To optionally match any number of times, use {{*}}, the Kleene star:
     322
     323  (string-search '(: "<" (* (~ #\>)) ">") "<html>") => <match>
     324 
     325  (string-search '(: "<" (* (~ #\>)) ">") "<>") => <match>
     326 
     327  (string-search '(: "<" (* (~ #\>)) ">") "<html") => #f
     328
     329Often you want to match any number of times, but at least one time is required, and for that you use {{+}}:
     330
     331  (string-search '(: "<" (+ (~ #\>)) ">") "<html>") => <match>
     332 
     333  (string-search '(: "<" (+ (~ #\>)) ">") "<a>") => <match>
     334 
     335  (string-search '(: "<" (+ (~ #\>)) ">") "<>") => #f
     336
     337More generally, to match at least a given number of times, use {{>=}}:
     338
     339  (string-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => <match>
     340
     341  (string-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => <match>
     342
     343  (string-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f
     344
     345To match a specific number of times exactly, use {=}:
     346
     347  (string-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => <match>
     348 
     349  (string-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f
     350
     351And finally, the most general form is {{**}} which specifies a range
     352of times to match. All of the earlier forms are special cases of this.
     353
     354  (string-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => <match>
     355
     356  (string-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f
     357
     358There are also so-called "non-greedy" variants of these repetition
     359operators, by convention suffixed with an additional {{?}}. Since the
     360normal repetition patterns can match any of the allotted repetition
     361range, these operators will match a string if and only if the normal
     362versions matched. However, when the endpoints of which submatch
     363matched where are taken into account (specifically, all matches when
     364using string-search since the endpoints of the match itself matter),
     365the use of a non-greedy repetition can change the result.
     366
     367So, whereas {{?}} can be thought to mean "match or don't match," {{??}} means
     368"don't match or match." {{*}} typically consumes as much as possible, but
     369{{*?}} tries first to match zero times, and only consumes one at a time if
     370that fails. If you have a greedy operator followed by a non-greedy
     371operator in the same pattern, they can produce surprisins results as
     372they compete to make the match longer or shorter. If this seems
     373confusing, that's because it is. Non-greedy repetitions are defined
     374only in terms of the specific backtracking algorithm used to implement
     375them, which for compatibility purposes always means the Perl
     376algorithm. Thus, when using these patterns you force IrRegex to use a
     377backtracking engine, and can't rely on efficient execution.
     378
     379====  SRE Character Sets
     380
     381Perhaps more common than matching specific strings is matching any of
     382a set of characters. You can use the or alternation pattern on a list
     383of single-character strings to simulate a character set, but this is
     384too clumsy for everyday use so SRE syntax allows a number of
     385shortcuts.
     386
     387A single character matches that character literally, a trivial
     388character class. More conveniently, a list holding a single element
     389which is a string refers to the character set composed of every
     390character in the string.
     391
     392  (string-match '(* #\-) "---") => <match>
     393 
     394  (string-match '(* #\-) "-_-") => #f
     395 
     396  (string-match '(* ("aeiou")) "oui") => <match>
     397 
     398  (string-match '(* ("aeiou")) "ouais") => #f
     399
     400Ranges are introduced with the {{/}} operator. Any strings or characters
     401in the {{/}} are flattened and then taken in pairs to represent the start
     402and end points, inclusive, of character ranges.
     403
     404  (string-match '(* (/ "AZ09")) "R2D2") => <match>
     405 
     406  (string-match '(* (/ "AZ09")) "C-3PO") => #f
     407
     408In addition, a number of set algebra operations are provided. or, of
     409course, has the same meaning, but when all the options are character
     410sets it can be thought of as the set union operator. This is further
     411extended by the {{&}} set intersection, {{-}} set difference, and {{~}} set
     412complement operators.
     413
     414  (string-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => <match>
     415 
     416  (string-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f
     417
     418  (string-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => <match>
     419 
     420  (string-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f
     421
     422====  SRE Assertion Patterns
     423
     424There are a number of times it can be useful to assert something about
     425the area around a pattern without explicitly making it part of the
     426pattern. The most common cases are specifically anchoring some pattern
     427to the beginning or end of a word or line or even the whole
     428string. For example, to match on the end of a word:
     429
     430  (string-match '(: "foo" eow) "foo") => <match>
     431 
     432  (string-match '(: "foo" eow) "foo!") => <match>
     433 
     434  (string-match '(: "foo" eow) "foof") => #f
     435
     436The {{bow}}, {{bol}}, {{eol}}, {{bos}} and {{eos}} work similarly. {{nwb}} asserts that you
     437are not in a word-boundary - if replaced for {{eow}} in the above examples
     438it would reverse all the results.
     439
     440There is no {{wb}}, since you tend to know from context whether it
     441would be the beginning or end of a word, but if you need it you can
     442always use (or bow eow).
     443
     444Somewhat more generally, Perl introduced positive and negative
     445look-ahead and look-behind patterns. Perl look-behind patterns are
     446limited to a fixed length, however the IrRegex versions have no such
     447limit.
     448
     449  (string-match '(: "regular" (look-ahead " expression")) "regular expression") => <match>
     450
     451The most general case, of course, would be an and pattern to
     452complement the or pattern - all the patterns must match or the whole
     453pattern fails. This may be provided in a future release, although it
     454(and look-ahead and look-behind assertions) are unlikely to be
     455compiled efficiently.
     456
     457
    194458---
    195459Previous: [[Unit extras]]
  • chicken/trunk/scripts/makedist.scm

    r13816 r13851  
    4848        (warning "files missing" missing) ) )
    4949    (run (tar cfz ,(conc distname ".tar.gz") ,distname))
    50     (when full?
    51       (run (cp ,tgz site)) )
    5250    (run (rm -fr ,distname)) ) )
    5351
Note: See TracChangeset for help on using the changeset viewer.