| 1 | [[tags: manual]] |
|---|
| 2 | |
|---|
| 3 | == Unit irregex |
|---|
| 4 | |
|---|
| 5 | This library unit provides support for irregular expressions. It |
|---|
| 6 | supports both POSIX syntax with various (irregular) PCRE extensions, |
|---|
| 7 | as well as SCSH's SRE syntax, with various aliases for commonly used |
|---|
| 8 | patterns. DFA matching is used when possible, otherwise a |
|---|
| 9 | closure-compiled NFA approach is used. Matching may be performed over |
|---|
| 10 | standard Scheme strings, or over arbitrarily chunked streams of |
|---|
| 11 | strings. |
|---|
| 12 | |
|---|
| 13 | On systems that support dynamic loading, the {{irregex}} unit can |
|---|
| 14 | be made available in the Chicken interpreter ({{csi}}) by entering |
|---|
| 15 | |
|---|
| 16 | <enscript highlight=scheme> |
|---|
| 17 | (require-extension irregex) |
|---|
| 18 | </enscript> |
|---|
| 19 | |
|---|
| 20 | [[toc:]] |
|---|
| 21 | |
|---|
| 22 | === Specification |
|---|
| 23 | |
|---|
| 24 | ==== Procedures |
|---|
| 25 | |
|---|
| 26 | ===== irregex |
|---|
| 27 | ===== string->irregex |
|---|
| 28 | ===== sre->irregex |
|---|
| 29 | |
|---|
| 30 | <procedure>(irregex <posix-string-or-sre> [<options> ...])</procedure><br> |
|---|
| 31 | <procedure>(string->irregex <posix-string> [<options> ...])</procedure><br> |
|---|
| 32 | <procedure>(sre->irregex <sre> [<options> ...])</procedure><br> |
|---|
| 33 | |
|---|
| 34 | Compiles a regular expression from either a POSIX-style regular |
|---|
| 35 | expression string (with most PCRE extensions) or an SCSH-style SRE. |
|---|
| 36 | There is no {{(rx ...)}} syntax - just use normal Scheme lists, with |
|---|
| 37 | {{quasiquote}} if you like. |
|---|
| 38 | |
|---|
| 39 | Technically a string by itself could be considered a valid (though |
|---|
| 40 | rather silly) SRE, so if you want to just match a literal string you |
|---|
| 41 | should use something like {{(irregex `(: ,str))}}, or use the explicit |
|---|
| 42 | {{(sre->irregex str)}}. |
|---|
| 43 | |
|---|
| 44 | The options are a list of any of the following symbols: |
|---|
| 45 | |
|---|
| 46 | ; {{'i}}, {{'case-insensitive}} : match case-insensitively |
|---|
| 47 | ; {{'m}}, {{'multi-line}} : treat string as multiple lines (effects {{^}} and {{$}}) |
|---|
| 48 | ; {{'s}}, {{'single-line}} : treat string as a single line ({{.}} can match newline) |
|---|
| 49 | ; {{'utf8}} : utf8-mode (assumes strings are byte-strings) |
|---|
| 50 | ; {{'fast}} : try to optimize the regular expression |
|---|
| 51 | ; {{'small}} : try to compile a smaller regular expression |
|---|
| 52 | ; {{'backtrack}} : enforce a backtracking implementation |
|---|
| 53 | |
|---|
| 54 | The {{'fast}} and {{'small}} options are heuristic guidelines and will |
|---|
| 55 | not necessarily make the compiled expression faster or smaller. |
|---|
| 56 | |
|---|
| 57 | ===== string->sre |
|---|
| 58 | ===== maybe-string->sre |
|---|
| 59 | |
|---|
| 60 | <procedure>(string->sre <str>)</procedure><br> |
|---|
| 61 | <procedure>(maybe-string->sre <obj>)</procedure><br> |
|---|
| 62 | |
|---|
| 63 | For backwards compatibility, procedures to convert a POSIX string into |
|---|
| 64 | an SRE. |
|---|
| 65 | |
|---|
| 66 | {{maybe-string->sre}} does the same thing, but only if the argument is |
|---|
| 67 | a string, otherwise it assumes {{<obj>}} is an SRE and returns it |
|---|
| 68 | as-is. This is useful when you want to provide an API that allows |
|---|
| 69 | either a POSIX string or SRE (like {{irregex}} or {{irregex-search}} |
|---|
| 70 | below) - it ensures the result is an SRE. |
|---|
| 71 | |
|---|
| 72 | ===== irregex? |
|---|
| 73 | |
|---|
| 74 | <procedure>(irregex? <obj>)</procedure><br> |
|---|
| 75 | |
|---|
| 76 | Returns {{#t}} iff the object is a regular expression. |
|---|
| 77 | |
|---|
| 78 | ===== irregex-search |
|---|
| 79 | |
|---|
| 80 | <procedure>(irregex-search <irx> <str> [<start> <end>])</procedure> |
|---|
| 81 | |
|---|
| 82 | Searches for any instances of the pattern {{<irx>}} (a POSIX string, SRE |
|---|
| 83 | sexp, or pre-compiled regular expression) in {{<str>}}, optionally between |
|---|
| 84 | the given range. If a match is found, returns a match object, |
|---|
| 85 | otherwise returns {{#f}}. |
|---|
| 86 | |
|---|
| 87 | Match objects can be used to query the original range of the string or |
|---|
| 88 | its submatches using the {{irregex-match-*}} procedures below. |
|---|
| 89 | |
|---|
| 90 | Examples: |
|---|
| 91 | |
|---|
| 92 | <enscript highlight=scheme> |
|---|
| 93 | (irregex-search "foobar" "abcFOOBARdef") => #f |
|---|
| 94 | |
|---|
| 95 | (irregex-search "foobar" "abcFOOBARdef" 'i) => #<match> |
|---|
| 96 | |
|---|
| 97 | (irregex-search '(w/nocase "foobar") "abcFOOBARdef") => #<match> |
|---|
| 98 | </enscript> |
|---|
| 99 | |
|---|
| 100 | Note, the actual match result is represented by a vector in the |
|---|
| 101 | default implementation. Throughout this manual, we'll just write |
|---|
| 102 | {{#<match>}} to show that a successful match was returned when the |
|---|
| 103 | details are not important. |
|---|
| 104 | |
|---|
| 105 | Matching follows the POSIX leftmost, longest semantics, when |
|---|
| 106 | searching. That is, of all possible matches in the string, |
|---|
| 107 | {{irregex-search}} will return the match at the first position |
|---|
| 108 | (leftmost). If multiple matches are possible from that same first |
|---|
| 109 | position, the longest match is returned. |
|---|
| 110 | |
|---|
| 111 | ===== irregex-match |
|---|
| 112 | |
|---|
| 113 | <procedure>(irregex-match <irx> <str>)</procedure> |
|---|
| 114 | |
|---|
| 115 | Like {{irregex-search}}, but performs an anchored match against the |
|---|
| 116 | beginning and end of the string, without searching. |
|---|
| 117 | |
|---|
| 118 | Examples: |
|---|
| 119 | |
|---|
| 120 | <enscript highlight=scheme> |
|---|
| 121 | (irregex-match '(w/nocase "foobar") "abcFOOBARdef") => #f |
|---|
| 122 | |
|---|
| 123 | (irregex-match '(w/nocase "foobar") "FOOBAR") => #<match> |
|---|
| 124 | </enscript> |
|---|
| 125 | |
|---|
| 126 | ===== irregex-match-data? |
|---|
| 127 | |
|---|
| 128 | <procedure>(irregex-match-data? <obj>)</procedure> |
|---|
| 129 | |
|---|
| 130 | Returns {{#t}} iff the object is a successful match result from |
|---|
| 131 | {{irregex-search}} or {{irregex-match}}. |
|---|
| 132 | |
|---|
| 133 | ===== irregex-num-submatches |
|---|
| 134 | ===== irregex-match-num-submatches |
|---|
| 135 | |
|---|
| 136 | <procedure>(irregex-num-submatches <irx>)</procedure><br> |
|---|
| 137 | <procedure>(irregex-match-num-submatches <match>)</procedure> |
|---|
| 138 | |
|---|
| 139 | Returns the number of numbered submatches that are defined in the |
|---|
| 140 | irregex or match object. |
|---|
| 141 | |
|---|
| 142 | ===== irregex-names |
|---|
| 143 | ===== irregex-match-names |
|---|
| 144 | |
|---|
| 145 | <procedure>(irregex-names <irx>)</procedure><br> |
|---|
| 146 | <procedure>(irregex-match-names <match>)</procedure> |
|---|
| 147 | |
|---|
| 148 | Returns an association list of named submatches that are defined in |
|---|
| 149 | the irregex or match object. The {{car}} of each item in this list is |
|---|
| 150 | the name of a submatch, the {{cdr}} of each item is the numerical |
|---|
| 151 | submatch corresponding to this name. If a named submatch occurs |
|---|
| 152 | multiple times in the irregex, it will also occur multiple times in |
|---|
| 153 | this list. |
|---|
| 154 | |
|---|
| 155 | ===== irregex-match-substring |
|---|
| 156 | ===== irregex-match-start-index |
|---|
| 157 | ===== irregex-match-end-index |
|---|
| 158 | |
|---|
| 159 | <procedure>(irregex-match-substring <match> [<index-or-name>])</procedure><br> |
|---|
| 160 | <procedure>(irregex-match-start-index <match> <index-or-name>)</procedure><br> |
|---|
| 161 | <procedure>(irregex-match-end-index <match> <index-or-name>)</procedure> |
|---|
| 162 | |
|---|
| 163 | Fetches the matched substring (or its start or end offset) at the |
|---|
| 164 | given submatch index, or named submatch. The entire match is index 0, |
|---|
| 165 | the first 1, etc. The default is index 0. |
|---|
| 166 | |
|---|
| 167 | ===== irregex-match-subchunk |
|---|
| 168 | |
|---|
| 169 | <procedure>(irregex-match-subchunk <match> [<index-or-name>])</procedure> |
|---|
| 170 | |
|---|
| 171 | Generates a chunked data-type for the given match item, of the same |
|---|
| 172 | type as the underlying chunk type (see Chunked String Matching below). |
|---|
| 173 | This is only available if the chunk type specifies the get-subchunk |
|---|
| 174 | API, otherwise an error is raised. |
|---|
| 175 | |
|---|
| 176 | ===== irregex-replace |
|---|
| 177 | ===== irregex-replace/all |
|---|
| 178 | |
|---|
| 179 | <procedure>(irregex-replace <irx> <str> [<replacements> ...])</procedure><br> |
|---|
| 180 | <procedure>(irregex-replace/all <irx> <str> [<replacements> ...])</procedure> |
|---|
| 181 | |
|---|
| 182 | Matches a pattern in a string, and replaces it with a (possibly empty) |
|---|
| 183 | list of substitutions. Each {{<replacement>}} can be either a string |
|---|
| 184 | literal, a numeric index, a symbol (as a named submatch), or a |
|---|
| 185 | procedure which takes one argument (the match object) and returns a |
|---|
| 186 | string. |
|---|
| 187 | |
|---|
| 188 | Examples: |
|---|
| 189 | |
|---|
| 190 | <enscript highlight=scheme> |
|---|
| 191 | (irregex-replace "[aeiou]" "hello world" "*") => "h*llo world" |
|---|
| 192 | |
|---|
| 193 | (irregex-replace/all "[aeiou]" "hello world" "*") => "h*ll* w*rld" |
|---|
| 194 | </enscript> |
|---|
| 195 | |
|---|
| 196 | ===== irregex-split |
|---|
| 197 | ===== irregex-extract |
|---|
| 198 | |
|---|
| 199 | <procedure>(irregex-split <irx> <str> [<start> <end>])</procedure><br> |
|---|
| 200 | <procedure>(irregex-extract <irx> <str> [<start> <end>])</procedure> |
|---|
| 201 | |
|---|
| 202 | {{irregex-split}} splits the string {{<str>}} into substrings divided |
|---|
| 203 | by the pattern in {{<irx>}}. {{irregex-extract}} does the opposite, |
|---|
| 204 | returning a list of each instance of the pattern matched disregarding |
|---|
| 205 | the substrings in between. |
|---|
| 206 | |
|---|
| 207 | ===== irregex-fold |
|---|
| 208 | |
|---|
| 209 | <procedure>(irregex-fold <irx> <kons> <knil> <str> [<finish> <start> <end>])</procedure> |
|---|
| 210 | |
|---|
| 211 | This performs a fold operation over every non-overlapping place |
|---|
| 212 | {{<irx>}} occurs in the string {{str}}. |
|---|
| 213 | |
|---|
| 214 | The {{<kons>}} procedure takes the following signature: |
|---|
| 215 | |
|---|
| 216 | <enscript highlight=scheme> |
|---|
| 217 | (<kons> <from-index> <match> <seed>) |
|---|
| 218 | </enscript> |
|---|
| 219 | |
|---|
| 220 | where {{<from-index>}} is the index from where we started searching |
|---|
| 221 | (initially {{<start>}} and thereafter the end index of the last |
|---|
| 222 | match), {{<match>}} is the resulting match-data object, and {{<seed>}} |
|---|
| 223 | is the accumulated fold result starting with {{<knil>}}. |
|---|
| 224 | |
|---|
| 225 | The rationale for providing the {{<from-index>}} (which is not |
|---|
| 226 | provided in the SCSH {{regexp-fold}} utility), is because this |
|---|
| 227 | information is useful (e.g. for extracting the unmatched portion of |
|---|
| 228 | the string before the current match, as needed in |
|---|
| 229 | {{irregex-replace}}), and not otherwise directly accessible. |
|---|
| 230 | |
|---|
| 231 | The optional {{<finish>}} takes two arguments: |
|---|
| 232 | |
|---|
| 233 | <enscript highlight=scheme> |
|---|
| 234 | (<finish> <from-index> <seed>) |
|---|
| 235 | </enscript> |
|---|
| 236 | |
|---|
| 237 | which simiarly allows you to pick up the unmatched tail of the string, |
|---|
| 238 | and defaults to just returning the {{<seed>}}. |
|---|
| 239 | |
|---|
| 240 | {{<start>}} and {{<end>}} are numeric indices letting you specify the |
|---|
| 241 | boundaries of the string on which you want to fold. |
|---|
| 242 | |
|---|
| 243 | To extract all instances of a match out of a string, you can use |
|---|
| 244 | |
|---|
| 245 | <enscript highlight=scheme> |
|---|
| 246 | (map irregex-match-substring |
|---|
| 247 | (irregex-fold <irx> |
|---|
| 248 | (lambda (i m s) (cons m s)) |
|---|
| 249 | '() |
|---|
| 250 | <str> |
|---|
| 251 | (lambda (i s) (reverse s)))) |
|---|
| 252 | </enscript> |
|---|
| 253 | |
|---|
| 254 | ==== Extended SRE Syntax |
|---|
| 255 | |
|---|
| 256 | Irregex provides the first native implementation of SREs (Scheme |
|---|
| 257 | Regular Expressions), and includes many extensions necessary both for |
|---|
| 258 | minimal POSIX compatibility, as well as for modern extensions found in |
|---|
| 259 | libraries such as PCRE. |
|---|
| 260 | |
|---|
| 261 | The following table summarizes the SRE syntax, with detailed |
|---|
| 262 | explanations following. |
|---|
| 263 | |
|---|
| 264 | ;; basic patterns |
|---|
| 265 | <string> ; literal string |
|---|
| 266 | (seq <sre> ...) ; sequence |
|---|
| 267 | (: <sre> ...) |
|---|
| 268 | (or <sre> ...) ; alternation |
|---|
| 269 | |
|---|
| 270 | ;; optional/multiple patterns |
|---|
| 271 | (? <sre> ...) ; 0 or 1 matches |
|---|
| 272 | (* <sre> ...) ; 0 or more matches |
|---|
| 273 | (+ <sre> ...) ; 1 or more matches |
|---|
| 274 | (= <n> <sre> ...) ; exactly <n> matches |
|---|
| 275 | (>= <n> <sre> ...) ; <n> or more matches |
|---|
| 276 | (** <from> <to> <sre> ...) ; <n> to <m> matches |
|---|
| 277 | (?? <sre> ...) ; non-greedy (non-greedy) pattern: (0 or 1) |
|---|
| 278 | (*? <sre> ...) ; non-greedy kleene star |
|---|
| 279 | (**? <from> <to> <sre> ...) ; non-greedy range |
|---|
| 280 | |
|---|
| 281 | ;; submatch patterns |
|---|
| 282 | (submatch <sre> ...) ; numbered submatch |
|---|
| 283 | ($ <sre> ...) |
|---|
| 284 | (submatch-named <name> <sre> ...) ; named submatch |
|---|
| 285 | (=> <name> <sre> ...) |
|---|
| 286 | (backref <n-or-name>) ; match a previous submatch |
|---|
| 287 | |
|---|
| 288 | ;; toggling case-sensitivity |
|---|
| 289 | (w/case <sre> ...) ; enclosed <sre>s are case-sensitive |
|---|
| 290 | (w/nocase <sre> ...) ; enclosed <sre>s are case-insensitive |
|---|
| 291 | |
|---|
| 292 | ;; character sets |
|---|
| 293 | <char> ; singleton char set |
|---|
| 294 | (<string>) ; set of chars |
|---|
| 295 | (or <cset-sre> ...) ; set union |
|---|
| 296 | (~ <cset-sre> ...) ; set complement (i.e. [^...]) |
|---|
| 297 | (- <cset-sre> ...) ; set difference |
|---|
| 298 | (& <cset-sre> ...) ; set intersection |
|---|
| 299 | (/ <range-spec> ...) ; pairs of chars as ranges |
|---|
| 300 | |
|---|
| 301 | ;; named character sets |
|---|
| 302 | any |
|---|
| 303 | nonl |
|---|
| 304 | ascii |
|---|
| 305 | lower-case lower |
|---|
| 306 | upper-case upper |
|---|
| 307 | alphabetic alpha |
|---|
| 308 | numeric num |
|---|
| 309 | alphanumeric alphanum alnum |
|---|
| 310 | punctuation punct |
|---|
| 311 | graphic graph |
|---|
| 312 | whitespace white space |
|---|
| 313 | printing print |
|---|
| 314 | control cntrl |
|---|
| 315 | hex-digit xdigit |
|---|
| 316 | |
|---|
| 317 | ;; assertions and conditionals |
|---|
| 318 | bos eos ; beginning/end of string |
|---|
| 319 | bol eol ; beginning/end of line |
|---|
| 320 | bow eow ; beginning/end of word |
|---|
| 321 | nwb ; non-word-boundary |
|---|
| 322 | (look-ahead <sre> ...) ; zero-width look-ahead assertion |
|---|
| 323 | (look-behind <sre> ...) ; zero-width look-behind assertion |
|---|
| 324 | (neg-look-ahead <sre> ...) ; zero-width negative look-ahead assertion |
|---|
| 325 | (neg-look-behind <sre> ...) ; zero-width negative look-behind assertion |
|---|
| 326 | (atomic <sre> ...) ; for (?>...) independent patterns |
|---|
| 327 | (if <test> <pass> [<fail>]) ; conditional patterns |
|---|
| 328 | commit ; don't backtrack beyond this (i.e. cut) |
|---|
| 329 | |
|---|
| 330 | ;; backwards compatibility |
|---|
| 331 | (posix-string <string>) ; embed a POSIX string literal |
|---|
| 332 | |
|---|
| 333 | ===== Basic SRE Patterns |
|---|
| 334 | |
|---|
| 335 | The simplest SRE is a literal string, which matches that string |
|---|
| 336 | exactly. |
|---|
| 337 | |
|---|
| 338 | <enscript highlight=scheme> |
|---|
| 339 | (irregex-search "needle" "hayneedlehay") => #<match> |
|---|
| 340 | </enscipt> |
|---|
| 341 | |
|---|
| 342 | By default the match is case-sensitive, though you can control this |
|---|
| 343 | either with the compiler flags or local overrides: |
|---|
| 344 | |
|---|
| 345 | <enscript highlight=scheme> |
|---|
| 346 | (irregex-search "needle" "haynEEdlehay") => #f |
|---|
| 347 | |
|---|
| 348 | (irregex-search (irregex "needle" 'i) "haynEEdlehay") => #<match> |
|---|
| 349 | |
|---|
| 350 | (irregex-search '(w/nocase "needle") "haynEEdlehay") => #<match> |
|---|
| 351 | </enscript> |
|---|
| 352 | |
|---|
| 353 | You can use {{w/case}} to switch back to case-sensitivity inside a |
|---|
| 354 | {{w/nocase}} or when the SRE was compiled with {{'i}}: |
|---|
| 355 | |
|---|
| 356 | <enscript highlight=scheme> |
|---|
| 357 | (irregex-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => #<match> |
|---|
| 358 | |
|---|
| 359 | (irregex-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f |
|---|
| 360 | </enscript> |
|---|
| 361 | |
|---|
| 362 | Of course, literal strings by themselves aren't very interesting |
|---|
| 363 | regular expressions, so we want to be able to compose them. The most |
|---|
| 364 | basic way to do this is with the {{seq}} operator (or its abbreviation |
|---|
| 365 | {{:}}), which matches one or more patterns consecutively: |
|---|
| 366 | |
|---|
| 367 | <enscript highlight=scheme> |
|---|
| 368 | (irregex-search '(: "one" space "two" space "three") "one two three") => #<match> |
|---|
| 369 | </enscript> |
|---|
| 370 | |
|---|
| 371 | As you may have noticed above, the {{w/case}} and {{w/nocase}} |
|---|
| 372 | operators allowed multiple SREs in a sequence - other operators that |
|---|
| 373 | take any number of arguments (e.g. the repetition operators below) |
|---|
| 374 | allow such implicit sequences. |
|---|
| 375 | |
|---|
| 376 | To match any one of a set of patterns use the {{or}} alternation |
|---|
| 377 | operator: |
|---|
| 378 | |
|---|
| 379 | <enscript highlight=scheme> |
|---|
| 380 | (irregex-search '(or "eeney" "meeney" "miney") "meeney") => #<match> |
|---|
| 381 | |
|---|
| 382 | (irregex-search '(or "eeney" "meeney" "miney") "moe") => #f |
|---|
| 383 | </enscript> |
|---|
| 384 | |
|---|
| 385 | ===== SRE Repetition Patterns |
|---|
| 386 | |
|---|
| 387 | There are also several ways to control the number of times a pattern |
|---|
| 388 | is matched. The simplest of these is {{?}} which just optionally |
|---|
| 389 | matches the pattern: |
|---|
| 390 | |
|---|
| 391 | <enscript highlight=scheme> |
|---|
| 392 | (irregex-search '(: "match" (? "es") "!") "matches!") => #<match> |
|---|
| 393 | |
|---|
| 394 | (irregex-search '(: "match" (? "es") "!") "match!") => #<match> |
|---|
| 395 | |
|---|
| 396 | (irregex-search '(: "match" (? "es") "!") "matche!") => #<match> |
|---|
| 397 | </enscript> |
|---|
| 398 | |
|---|
| 399 | To optionally match any number of times, use {{*}}, the Kleene star: |
|---|
| 400 | |
|---|
| 401 | <enscript highlight=scheme> |
|---|
| 402 | (irregex-search '(: "<" (* (~ #\>)) ">") "<html>") => #<match> |
|---|
| 403 | |
|---|
| 404 | (irregex-search '(: "<" (* (~ #\>)) ">") "<>") => #<match> |
|---|
| 405 | |
|---|
| 406 | (irregex-search '(: "<" (* (~ #\>)) ">") "<html") => #f |
|---|
| 407 | </enscript> |
|---|
| 408 | |
|---|
| 409 | Often you want to match any number of times, but at least one time is |
|---|
| 410 | required, and for that you use {{+}}: |
|---|
| 411 | |
|---|
| 412 | <enscript highlight=scheme> |
|---|
| 413 | (irregex-search '(: "<" (+ (~ #\>)) ">") "<html>") => #<match> |
|---|
| 414 | |
|---|
| 415 | (irregex-search '(: "<" (+ (~ #\>)) ">") "<a>") => #<match> |
|---|
| 416 | |
|---|
| 417 | (irregex-search '(: "<" (+ (~ #\>)) ">") "<>") => #f |
|---|
| 418 | </enscript> |
|---|
| 419 | |
|---|
| 420 | More generally, to match at least a given number of times, use {{>=}}: |
|---|
| 421 | |
|---|
| 422 | <enscript highlight=scheme> |
|---|
| 423 | (irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => #<match> |
|---|
| 424 | |
|---|
| 425 | (irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => #<match> |
|---|
| 426 | |
|---|
| 427 | (irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f |
|---|
| 428 | </enscript> |
|---|
| 429 | |
|---|
| 430 | To match a specific number of times exactly, use {{=}}: |
|---|
| 431 | |
|---|
| 432 | <enscript highlight=scheme> |
|---|
| 433 | (irregex-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => #<match> |
|---|
| 434 | |
|---|
| 435 | (irregex-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f |
|---|
| 436 | </enscript> |
|---|
| 437 | |
|---|
| 438 | And finally, the most general form is {{**}} which specifies a range |
|---|
| 439 | of times to match. All of the earlier forms are special cases of this. |
|---|
| 440 | |
|---|
| 441 | <enscript highlight=scheme> |
|---|
| 442 | (irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => #<match> |
|---|
| 443 | |
|---|
| 444 | (irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f |
|---|
| 445 | </enscript> |
|---|
| 446 | |
|---|
| 447 | There are also so-called "non-greedy" variants of these repetition |
|---|
| 448 | operators, by convention suffixed with an additional {{?}}. Since the |
|---|
| 449 | normal repetition patterns can match any of the allotted repetition |
|---|
| 450 | range, these operators will match a string if and only if the normal |
|---|
| 451 | versions matched. However, when the endpoints of which submatch |
|---|
| 452 | matched where are taken into account (specifically, all matches when |
|---|
| 453 | using irregex-search since the endpoints of the match itself matter), |
|---|
| 454 | the use of a non-greedy repetition can change the result. |
|---|
| 455 | |
|---|
| 456 | So, whereas {{?}} can be thought to mean "match or don't match," |
|---|
| 457 | {{??}} means "don't match or match." {{*}} typically consumes as much |
|---|
| 458 | as possible, but {{*?}} tries first to match zero times, and only |
|---|
| 459 | consumes one at a time if that fails. If you have a greedy operator |
|---|
| 460 | followed by a non-greedy operator in the same pattern, they can |
|---|
| 461 | produce surprisins results as they compete to make the match longer or |
|---|
| 462 | shorter. If this seems confusing, that's because it is. Non-greedy |
|---|
| 463 | repetitions are defined only in terms of the specific backtracking |
|---|
| 464 | algorithm used to implement them, which for compatibility purposes |
|---|
| 465 | always means the Perl algorithm. Thus, when using these patterns you |
|---|
| 466 | force IrRegex to use a backtracking engine, and can't rely on |
|---|
| 467 | efficient execution. |
|---|
| 468 | |
|---|
| 469 | ===== SRE Character Sets |
|---|
| 470 | |
|---|
| 471 | Perhaps more common than matching specific strings is matching any of |
|---|
| 472 | a set of characters. You can use the {{or}} alternation pattern on a |
|---|
| 473 | list of single-character strings to simulate a character set, but this |
|---|
| 474 | is too clumsy for everyday use so SRE syntax allows a number of |
|---|
| 475 | shortcuts. |
|---|
| 476 | |
|---|
| 477 | A single character matches that character literally, a trivial |
|---|
| 478 | character class. More conveniently, a list holding a single element |
|---|
| 479 | which is a string refers to the character set composed of every |
|---|
| 480 | character in the string. |
|---|
| 481 | |
|---|
| 482 | <enscript highlight=scheme> |
|---|
| 483 | (irregex-match '(* #\-) "---") => #<match> |
|---|
| 484 | |
|---|
| 485 | (irregex-match '(* #\-) "-_-") => #f |
|---|
| 486 | |
|---|
| 487 | (irregex-match '(* ("aeiou")) "oui") => #<match> |
|---|
| 488 | |
|---|
| 489 | (irregex-match '(* ("aeiou")) "ouais") => #f |
|---|
| 490 | </enscript> |
|---|
| 491 | |
|---|
| 492 | Ranges are introduced with the \q{/} operator. Any strings or |
|---|
| 493 | characters in the \q{/} are flattened and then taken in pairs to |
|---|
| 494 | represent the start and end points, inclusive, of character ranges. |
|---|
| 495 | |
|---|
| 496 | <enscript highlight=scheme> |
|---|
| 497 | (irregex-match '(* (/ "AZ09")) "R2D2") => #<match> |
|---|
| 498 | |
|---|
| 499 | (irregex-match '(* (/ "AZ09")) "C-3PO") => #f |
|---|
| 500 | </enscript> |
|---|
| 501 | |
|---|
| 502 | In addition, a number of set algebra operations are provided. \q{or}, |
|---|
| 503 | of course, has the same meaning, but when all the options are |
|---|
| 504 | character sets it can be thought of as the set union operator. This |
|---|
| 505 | is further extended by the \q{&} set intersection, \q{-} set |
|---|
| 506 | difference, and \q{~} set complement operators. |
|---|
| 507 | |
|---|
| 508 | <enscript highlight=scheme> |
|---|
| 509 | (irregex-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => #<match> |
|---|
| 510 | |
|---|
| 511 | (irregex-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f |
|---|
| 512 | |
|---|
| 513 | (irregex-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => #<match> |
|---|
| 514 | |
|---|
| 515 | (irregex-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f |
|---|
| 516 | </enscript> |
|---|
| 517 | |
|---|
| 518 | ===== SRE Assertion Patterns |
|---|
| 519 | |
|---|
| 520 | There are a number of times it can be useful to assert something about |
|---|
| 521 | the area around a pattern without explicitly making it part of the |
|---|
| 522 | pattern. The most common cases are specifically anchoring some |
|---|
| 523 | pattern to the beginning or end of a word or line or even the whole |
|---|
| 524 | string. For example, to match on the end of a word: |
|---|
| 525 | |
|---|
| 526 | <enscript highlight=scheme> |
|---|
| 527 | (irregex-match '(: "foo" eow) "foo") => #<match> |
|---|
| 528 | |
|---|
| 529 | (irregex-match '(: "foo" eow) "foo!") => #<match> |
|---|
| 530 | |
|---|
| 531 | (irregex-match '(: "foo" eow) "foof") => #f |
|---|
| 532 | </enscript> |
|---|
| 533 | |
|---|
| 534 | The {{bow}}, {{bol}}, {{eol}}, {{bos}} and {{eos}} work similarly. |
|---|
| 535 | {{nwb}} asserts that you are not in a word-boundary - if replaced for |
|---|
| 536 | {{eow}} in the above examples it would reverse all the results. |
|---|
| 537 | |
|---|
| 538 | There is no {{wb}}, since you tend to know from context whether it |
|---|
| 539 | would be the beginning or end of a word, but if you need it you can |
|---|
| 540 | always use {{(or bow eow)}}. |
|---|
| 541 | |
|---|
| 542 | Somewhat more generally, Perl introduced positive and negative |
|---|
| 543 | look-ahead and look-behind patterns. Perl look-behind patterns are |
|---|
| 544 | limited to a fixed length, however the IrRegex versions have no such |
|---|
| 545 | limit. |
|---|
| 546 | |
|---|
| 547 | <enscript highlight=scheme> |
|---|
| 548 | (irregex-match '(: "regular" (look-ahead " expression")) |
|---|
| 549 | "regular expression") |
|---|
| 550 | => #<match> |
|---|
| 551 | </enscript> |
|---|
| 552 | |
|---|
| 553 | The most general case, of course, would be an \q{and} pattern to |
|---|
| 554 | complement the \q{or} pattern - all the patterns must match or the |
|---|
| 555 | whole pattern fails. This may be provided in a future release, |
|---|
| 556 | although it (and look-ahead and look-behind assertions) are unlikely |
|---|
| 557 | to be compiled efficiently. |
|---|
| 558 | |
|---|
| 559 | ===== SRE Utility Patterns |
|---|
| 560 | |
|---|
| 561 | The following utility regular expressions are also provided for common |
|---|
| 562 | patterns that people are eternally reinventing. They are not |
|---|
| 563 | necessarily the official patterns matching the RFC definitions of the |
|---|
| 564 | given data, because of the way that such patterns tend to be used. |
|---|
| 565 | There are three general usages for regexps: |
|---|
| 566 | |
|---|
| 567 | ; searching : search for a pattern matching a desired object in a larger text |
|---|
| 568 | |
|---|
| 569 | ; validation : determine whether an entire string matches a pattern |
|---|
| 570 | |
|---|
| 571 | ; extraction : given a string already known to be valid, extract certain fields from it as submatches |
|---|
| 572 | |
|---|
| 573 | In some cases, but not always, these will overlap. When they are |
|---|
| 574 | different, {{irregex-search}} will naturally always want the searching |
|---|
| 575 | version, so IrRegex provides that version. |
|---|
| 576 | |
|---|
| 577 | As an example where these might be different, consider a URL. If you |
|---|
| 578 | want to match all the URLs in some arbitrary text, you probably want |
|---|
| 579 | to exclude a period or comma at the tail end of a URL, since it's more |
|---|
| 580 | likely being used as punctuation rather than part of the URL, despite |
|---|
| 581 | the fact that it would be valid URL syntax. |
|---|
| 582 | |
|---|
| 583 | Another problem with the RFC definitions is the standard itself may |
|---|
| 584 | have become irrelevant. For example, the pattern IrRegex provides for |
|---|
| 585 | email addresses doesn't match quoted local parts (e.g. |
|---|
| 586 | {{"first last"@domain.com}}) because these are increasingly rare, and |
|---|
| 587 | unsupported by enough software that it's better to discourage their use. |
|---|
| 588 | Conversely, technically consecutive periods |
|---|
| 589 | (e.g. {{first..last@domain.com}}) are not allowed in email addresses, but |
|---|
| 590 | most email software does allow this, and in fact such addresses are |
|---|
| 591 | quite common in Japan. |
|---|
| 592 | |
|---|
| 593 | The current patterns provided are: |
|---|
| 594 | |
|---|
| 595 | newline ; general newline pattern (crlf, cr, lf) |
|---|
| 596 | integer ; an integer |
|---|
| 597 | real ; a real number (including scientific) |
|---|
| 598 | string ; a "quoted" string |
|---|
| 599 | symbol ; an R5RS Scheme symbol |
|---|
| 600 | ipv4-address ; a numeric decimal ipv4 address |
|---|
| 601 | ipv6-address ; a numeric hexadecimal ipv6 address |
|---|
| 602 | domain ; a domain name |
|---|
| 603 | email ; an email address |
|---|
| 604 | http-url ; a URL beginning with https?:// |
|---|
| 605 | |
|---|
| 606 | Because of these issues the exact definitions of these patterns are |
|---|
| 607 | subject to be changed, but will be documented clearly when they are |
|---|
| 608 | finalized. More common patterns are also planned, but as what you |
|---|
| 609 | want increases in complexity it's probably better to use a real |
|---|
| 610 | parser. |
|---|
| 611 | |
|---|
| 612 | ==== Supported PCRE Syntax |
|---|
| 613 | |
|---|
| 614 | Since the PCRE syntax is so overwhelming complex, it's easier to just |
|---|
| 615 | list what we *don't* support for now. Refer to the |
|---|
| 616 | [[http://pcre.org/pcre.txt|PCRE documentation]] for details. You |
|---|
| 617 | should be using the SRE syntax anyway! |
|---|
| 618 | |
|---|
| 619 | Unicode character classes ({{\P}}) are not supported, but will be |
|---|
| 620 | in an upcoming release. {{\C}} named characters are not supported. |
|---|
| 621 | |
|---|
| 622 | Callbacks, subroutine patterns and recursive patterns are not |
|---|
| 623 | supported. ({{*FOO}}) patterns are not supported and may never be. |
|---|
| 624 | |
|---|
| 625 | {{\G}} and {{\K}} are not supported. |
|---|
| 626 | |
|---|
| 627 | Octal character escapes are not supported because they are ambiguous |
|---|
| 628 | with back-references - just use hex character escapes. |
|---|
| 629 | |
|---|
| 630 | Other than that everything should work, including named submatches, |
|---|
| 631 | zero-width assertions, conditional patterns, etc. |
|---|
| 632 | |
|---|
| 633 | In addition, {{\<}} and {{\>}} act as beginning-of-word and end-of-word |
|---|
| 634 | marks, respectively, as in Emacs regular expressions. |
|---|
| 635 | |
|---|
| 636 | Also, two escapes are provided to embed SRE patterns inside PCRE |
|---|
| 637 | strings, {{"\'<sre>"}} and {{"(*'<sre>)"}}. For example, to match a |
|---|
| 638 | comma-delimited list of integers you could use |
|---|
| 639 | |
|---|
| 640 | <enscript highlight=scheme> |
|---|
| 641 | "\\'integer(,\\'integer)*" |
|---|
| 642 | </enscript> |
|---|
| 643 | |
|---|
| 644 | and to match a URL in angle brackets you could use |
|---|
| 645 | |
|---|
| 646 | <enscript highlight=scheme> |
|---|
| 647 | "<('*http-url)>" |
|---|
| 648 | </enscript> |
|---|
| 649 | |
|---|
| 650 | Note in the second example the enclosing {{"('*...)"}} syntax is needed |
|---|
| 651 | because the Scheme reader would consider the closing {{">"}} as part of |
|---|
| 652 | the SRE symbol. |
|---|
| 653 | |
|---|
| 654 | The following chart gives a quick reference from PCRE form to the SRE |
|---|
| 655 | equivalent: |
|---|
| 656 | |
|---|
| 657 | ;; basic syntax |
|---|
| 658 | "^" ;; bos (or eos inside (?m: ...)) |
|---|
| 659 | "$" ;; eos (or eos inside (?m: ...)) |
|---|
| 660 | "." ;; nonl |
|---|
| 661 | "a?" ;; (? a) |
|---|
| 662 | "a*" ;; (* a) |
|---|
| 663 | "a+" ;; (+ a) |
|---|
| 664 | "a??" ;; (?? a) |
|---|
| 665 | "a*?" ;; (*? a) |
|---|
| 666 | "a+?" ;; (+? a) |
|---|
| 667 | "a{n,m}" ;; (** n m a) |
|---|
| 668 | |
|---|
| 669 | ;; grouping |
|---|
| 670 | "(...)" ;; (submatch ...) |
|---|
| 671 | "(?:...)" ;; (: ...) |
|---|
| 672 | "(?i:...)" ;; (w/nocase ...) |
|---|
| 673 | "(?-i:...)" ;; (w/case ...) |
|---|
| 674 | "(?<name>...)" ;; (=> <name>...) |
|---|
| 675 | |
|---|
| 676 | ;; character classes |
|---|
| 677 | "[aeiou]" ;; ("aeiou") |
|---|
| 678 | "[^aeiou]" ;; (~ "aeiou") |
|---|
| 679 | "[a-z]" ;; (/ "az") or (/ "a" "z") |
|---|
| 680 | "[[:alpha:]]" ;; alpha |
|---|
| 681 | |
|---|
| 682 | ;; assertions |
|---|
| 683 | "(?=...)" ;; (look-ahead ...) |
|---|
| 684 | "(?!...)" ;; (neg-look-ahead ...) |
|---|
| 685 | "(?<=...)" ;; (look-behind ...) |
|---|
| 686 | "(?<!...)" ;; (neg-look-behind ...) |
|---|
| 687 | "(?(test)pass|fail)" ;; (if test pass fail) |
|---|
| 688 | "(*COMMIT)" ;; commit |
|---|
| 689 | |
|---|
| 690 | ==== Chunked String Matching |
|---|
| 691 | |
|---|
| 692 | It's often desirable to perform regular expression matching over |
|---|
| 693 | sequences of characters not represented as a single string. The most |
|---|
| 694 | obvious example is a text-buffer data structure, but you may also want |
|---|
| 695 | to match over lists or trees of strings (i.e. ropes), over only |
|---|
| 696 | certain ranges within a string, over an input port, etc. With |
|---|
| 697 | existing regular expression libraries, the only way to accomplish this |
|---|
| 698 | is by converting the abstract sequence into a freshly allocated |
|---|
| 699 | string. This can be expensive, or even impossible if the object is a |
|---|
| 700 | text-buffer opened onto a 500MB file. |
|---|
| 701 | |
|---|
| 702 | IrRegex provides a chunked string API specifically for this purpose. |
|---|
| 703 | You define a chunking API with {{make-irregex-chunker}}: |
|---|
| 704 | |
|---|
| 705 | ===== make-irregex-chunker |
|---|
| 706 | |
|---|
| 707 | <procedure>(make-irregex-chunker <get-next> <get-string> [<get-start> <get-end> <get-substring> <get-subchunk>])</procedure> |
|---|
| 708 | |
|---|
| 709 | where |
|---|
| 710 | |
|---|
| 711 | {{(<get-next> chunk) => }} returns the next chunk, or {{#f}} if there are no more chunks |
|---|
| 712 | |
|---|
| 713 | {{(<get-string> chunk) => }} a string source for the chunk |
|---|
| 714 | |
|---|
| 715 | {{(<get-start> chunk) => }} the start index of the result of {{<get-string>}} (defaults to always 0) |
|---|
| 716 | |
|---|
| 717 | {{(<get-end> chunk) => }} the end (exclusive) of the string (defaults to {{string-length}} of the source string) |
|---|
| 718 | |
|---|
| 719 | {{(<get-substring> cnk1 i cnk2 j) => }} a substring for the range between the chunk {{cnk1}} starting at index {{i}} and ending at {{cnk2}} at index {{j}} |
|---|
| 720 | |
|---|
| 721 | {{(<get-subchunk> cnk1 i cnk2 j) => }} as above but returns a new chunked data type instead of a string (optional) |
|---|
| 722 | |
|---|
| 723 | There are two important constraints on the {{<get-next>}} procedure. |
|---|
| 724 | It must return an {{eq?}} identical object when called multiple times |
|---|
| 725 | on the same chunk, and it must not return a chunk with an empty string |
|---|
| 726 | (start == end). This second constraint is for performance reasons - |
|---|
| 727 | we push the work of possibly filtering empty chunks to the chunker |
|---|
| 728 | since there are many chunk types for which empty strings aren't |
|---|
| 729 | possible, and this work is thus not needed. Note that the initial |
|---|
| 730 | chunk passed to match on is allowed to be empty. |
|---|
| 731 | |
|---|
| 732 | {{<get-substring>}} is provided for possible performance improvements |
|---|
| 733 | - without it a default is used. {{<get-subchunk>}} is optional - |
|---|
| 734 | without it you may not use {{irregex-match-subchunk}} described above. |
|---|
| 735 | |
|---|
| 736 | You can then match chunks of these types with the following |
|---|
| 737 | procedures: |
|---|
| 738 | |
|---|
| 739 | ===== irregex-search/chunked |
|---|
| 740 | ===== irregex-match/chunked |
|---|
| 741 | |
|---|
| 742 | <procedure>(irregex-search/chunked <irx> <chunker> <chunk> [<start>])</procedure><br> |
|---|
| 743 | <procedure>(irregex-match/chunked <irx> <chunker> <chunk> [<start>])</procedure> |
|---|
| 744 | |
|---|
| 745 | These return normal match-data objects. |
|---|
| 746 | |
|---|
| 747 | Example: |
|---|
| 748 | |
|---|
| 749 | To match against a simple, flat list of strings use: |
|---|
| 750 | |
|---|
| 751 | <enscript highlight=scheme> |
|---|
| 752 | (define (rope->string rope1 start rope2 end) |
|---|
| 753 | (if (eq? rope1 rope2) |
|---|
| 754 | (substring (car rope1) start end) |
|---|
| 755 | (let loop ((rope (cdr rope1)) |
|---|
| 756 | (res (list (substring (car rope1) start)))) |
|---|
| 757 | (if (eq? rope rope2) |
|---|
| 758 | (string-concatenate-reverse ; from SRFI-13 |
|---|
| 759 | (cons (substring (car rope) 0 end) res)) |
|---|
| 760 | (loop (cdr rope) (cons (car rope) res)))))) |
|---|
| 761 | |
|---|
| 762 | (define rope-chunker |
|---|
| 763 | (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) |
|---|
| 764 | car |
|---|
| 765 | (lambda (x) 0) |
|---|
| 766 | (lambda (x) (string-length (car x))) |
|---|
| 767 | rope->string)) |
|---|
| 768 | |
|---|
| 769 | (irregex-search/chunked <pat> rope-chunker <list-of-strings>) |
|---|
| 770 | </enscript> |
|---|
| 771 | |
|---|
| 772 | Here we are just using the default start, end and substring behaviors, |
|---|
| 773 | so the above chunker could simply be defined as: |
|---|
| 774 | |
|---|
| 775 | <enscript highlight=scheme> |
|---|
| 776 | (define rope-chunker |
|---|
| 777 | (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) car)) |
|---|
| 778 | </enscript> |
|---|
| 779 | |
|---|
| 780 | ===== irregex-fold/chunked |
|---|
| 781 | |
|---|
| 782 | <procedure>(irregex-fold/chunked <irx> <kons> <knil> <chunker> <chunk> [<finish> [<start-index>]])</procedure> |
|---|
| 783 | |
|---|
| 784 | Chunked version of {{irregex-fold}}. |
|---|
| 785 | |
|---|
| 786 | ==== Utilities |
|---|
| 787 | |
|---|
| 788 | The following procedures are also available. |
|---|
| 789 | |
|---|
| 790 | ===== irregex-quote |
|---|
| 791 | |
|---|
| 792 | <procedure>(irregex-quote <str>)</procedure> |
|---|
| 793 | |
|---|
| 794 | Returns a new string with any special regular expression characters |
|---|
| 795 | escaped, to match the original string literally in POSIX regular |
|---|
| 796 | expressions. |
|---|
| 797 | |
|---|
| 798 | ===== irregex-opt |
|---|
| 799 | |
|---|
| 800 | <procedure>(irregex-opt <list-of-strings>)</procedure> |
|---|
| 801 | |
|---|
| 802 | Returns an optimized SRE matching any of the literal strings |
|---|
| 803 | in the list, like Emacs' \q{regexp-opt}. Note this optimization |
|---|
| 804 | doesn't help when irregex is able to build a DFA. |
|---|
| 805 | |
|---|
| 806 | ===== sre->string |
|---|
| 807 | |
|---|
| 808 | <procedure>(sre->string <sre>)</procedure> |
|---|
| 809 | |
|---|
| 810 | Convert an SRE to a POSIX-style regular expression string, if |
|---|
| 811 | possible. |
|---|
| 812 | |
|---|
| 813 | === License |
|---|
| 814 | |
|---|
| 815 | Copyright (c) 2005-2010 Alex Shinn |
|---|
| 816 | All rights reserved. |
|---|
| 817 | |
|---|
| 818 | Redistribution and use in source and binary forms, with or without |
|---|
| 819 | modification, are permitted provided that the following conditions |
|---|
| 820 | are met: |
|---|
| 821 | |
|---|
| 822 | 1. Redistributions of source code must retain the above copyright |
|---|
| 823 | notice, this list of conditions and the following disclaimer. |
|---|
| 824 | 2. Redistributions in binary form must reproduce the above copyright |
|---|
| 825 | notice, this list of conditions and the following disclaimer in the |
|---|
| 826 | documentation and/or other materials provided with the distribution. |
|---|
| 827 | 3. The name of the author may not be used to endorse or promote products |
|---|
| 828 | derived from this software without specific prior written permission. |
|---|
| 829 | |
|---|
| 830 | THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR |
|---|
| 831 | IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES |
|---|
| 832 | OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. |
|---|
| 833 | IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, |
|---|
| 834 | INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT |
|---|
| 835 | NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, |
|---|
| 836 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY |
|---|
| 837 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT |
|---|
| 838 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF |
|---|
| 839 | THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
|---|
| 840 | |
|---|
| 841 | === References |
|---|
| 842 | |
|---|
| 843 | * R. Kelsey, W. Clinger, J. Rees (eds.): ''[[http://www.schemers.org/Documents/Standards/R5RS/|Revised^5 Report on the Algorithmic Language Scheme]]'' |
|---|
| 844 | |
|---|
| 845 | * Russ Cox: ''[[http://swtch.com/~rsc/regexp/|Implementing Regular Expressions]]'' |
|---|
| 846 | |
|---|
| 847 | * Russ Cox: ''[[http://compilers.iecc.com/comparch/article/07-10-026|Henry Spencer's Tcl Regex Library]]'' |
|---|
| 848 | |
|---|
| 849 | * Olin Shivers: ''[[http://www.scsh.net/docu/post/sre.html|Proposed SRE regular-expression notation]]'' |
|---|
| 850 | |
|---|
| 851 | * Olin Shivers: ''[[http://www.scsh.net/docu/html/man-Z-H-7.html|Pattern-matching strings with regular expressions]]'' |
|---|
| 852 | |
|---|
| 853 | * Shiro Kawai: ''[[http://practical-scheme.net/gauche/man/gauche-refe_49.html|Gauche Scheme - Regular Expressions]]'' |
|---|
| 854 | |
|---|
| 855 | * Damian Conway: ''[[http://www.perl.com/pub/a/2002/08/22/exegesis5.html|Perl6 Exegesis 5 - Regular Expressions]]'' |
|---|
| 856 | |
|---|
| 857 | * Philip Hazel: ''[[http://www.pcre.org/|PCRE - Perl Compatible Regular Expressions]]'' |
|---|