1 | [[tags: manual]] |
---|
2 | |
---|
3 | == Unit irregex |
---|
4 | |
---|
5 | This library unit provides support for irregular expressions. It |
---|
6 | supports both POSIX syntax with various (irregular) PCRE extensions, |
---|
7 | as well as SCSH's SRE syntax, with various aliases for commonly used |
---|
8 | patterns. DFA matching is used when possible, otherwise a |
---|
9 | closure-compiled NFA approach is used. Matching may be performed over |
---|
10 | standard Scheme strings, or over arbitrarily chunked streams of |
---|
11 | strings. |
---|
12 | |
---|
13 | On systems that support dynamic loading, the {{irregex}} unit can |
---|
14 | be made available in the Chicken interpreter ({{csi}}) by entering |
---|
15 | |
---|
16 | <enscript highlight=scheme> |
---|
17 | (require-extension irregex) |
---|
18 | </enscript> |
---|
19 | |
---|
20 | [[toc:]] |
---|
21 | |
---|
22 | === Specification |
---|
23 | |
---|
24 | ==== Procedures |
---|
25 | |
---|
26 | ===== irregex |
---|
27 | ===== string->irregex |
---|
28 | ===== sre->irregex |
---|
29 | |
---|
30 | <procedure>(irregex <posix-string-or-sre> [<options> ...])</procedure><br> |
---|
31 | <procedure>(string->irregex <posix-string> [<options> ...])</procedure><br> |
---|
32 | <procedure>(sre->irregex <sre> [<options> ...])</procedure><br> |
---|
33 | |
---|
34 | Compiles a regular expression from either a POSIX-style regular |
---|
35 | expression string (with most PCRE extensions) or an SCSH-style SRE. |
---|
36 | There is no {{(rx ...)}} syntax - just use normal Scheme lists, with |
---|
37 | {{quasiquote}} if you like. |
---|
38 | |
---|
39 | Technically a string by itself could be considered a valid (though |
---|
40 | rather silly) SRE, so if you want to just match a literal string you |
---|
41 | should use something like {{(irregex `(: ,str))}}, or use the explicit |
---|
42 | {{(sre->irregex str)}}. |
---|
43 | |
---|
44 | The options are a list of any of the following symbols: |
---|
45 | |
---|
46 | ; {{'i}}, {{'case-insensitive}} : match case-insensitively |
---|
47 | ; {{'m}}, {{'multi-line}} : treat string as multiple lines (effects {{^}} and {{$}}) |
---|
48 | ; {{'s}}, {{'single-line}} : treat string as a single line ({{.}} can match newline) |
---|
49 | ; {{'utf8}} : utf8-mode (assumes strings are byte-strings) |
---|
50 | ; {{'fast}} : try to optimize the regular expression |
---|
51 | ; {{'small}} : try to compile a smaller regular expression |
---|
52 | ; {{'backtrack}} : enforce a backtracking implementation |
---|
53 | |
---|
54 | The {{'fast}} and {{'small}} options are heuristic guidelines and will |
---|
55 | not necessarily make the compiled expression faster or smaller. |
---|
56 | |
---|
57 | ===== string->sre |
---|
58 | ===== maybe-string->sre |
---|
59 | |
---|
60 | <procedure>(string->sre <str>)</procedure><br> |
---|
61 | <procedure>(maybe-string->sre <obj>)</procedure><br> |
---|
62 | |
---|
63 | For backwards compatibility, procedures to convert a POSIX string into |
---|
64 | an SRE. |
---|
65 | |
---|
66 | {{maybe-string->sre}} does the same thing, but only if the argument is |
---|
67 | a string, otherwise it assumes {{<obj>}} is an SRE and returns it |
---|
68 | as-is. This is useful when you want to provide an API that allows |
---|
69 | either a POSIX string or SRE (like {{irregex}} or {{irregex-search}} |
---|
70 | below) - it ensures the result is an SRE. |
---|
71 | |
---|
72 | ===== irregex? |
---|
73 | |
---|
74 | <procedure>(irregex? <obj>)</procedure><br> |
---|
75 | |
---|
76 | Returns {{#t}} iff the object is a regular expression. |
---|
77 | |
---|
78 | ===== irregex-search |
---|
79 | |
---|
80 | <procedure>(irregex-search <irx> <str> [<start> <end>])</procedure> |
---|
81 | |
---|
82 | Searches for any instances of the pattern {{<irx>}} (a POSIX string, SRE |
---|
83 | sexp, or pre-compiled regular expression) in {{<str>}}, optionally between |
---|
84 | the given range. If a match is found, returns a match object, |
---|
85 | otherwise returns {{#f}}. |
---|
86 | |
---|
87 | Match objects can be used to query the original range of the string or |
---|
88 | its submatches using the {{irregex-match-*}} procedures below. |
---|
89 | |
---|
90 | Examples: |
---|
91 | |
---|
92 | <enscript highlight=scheme> |
---|
93 | (irregex-search "foobar" "abcFOOBARdef") => #f |
---|
94 | |
---|
95 | (irregex-search "foobar" "abcFOOBARdef" 'i) => #<match> |
---|
96 | |
---|
97 | (irregex-search '(w/nocase "foobar") "abcFOOBARdef") => #<match> |
---|
98 | </enscript> |
---|
99 | |
---|
100 | Note, the actual match result is represented by a vector in the |
---|
101 | default implementation. Throughout this manual, we'll just write |
---|
102 | {{#<match>}} to show that a successful match was returned when the |
---|
103 | details are not important. |
---|
104 | |
---|
105 | Matching follows the POSIX leftmost, longest semantics, when |
---|
106 | searching. That is, of all possible matches in the string, |
---|
107 | {{irregex-search}} will return the match at the first position |
---|
108 | (leftmost). If multiple matches are possible from that same first |
---|
109 | position, the longest match is returned. |
---|
110 | |
---|
111 | ===== irregex-match |
---|
112 | |
---|
113 | <procedure>(irregex-match <irx> <str>)</procedure> |
---|
114 | |
---|
115 | Like {{irregex-search}}, but performs an anchored match against the |
---|
116 | beginning and end of the string, without searching. |
---|
117 | |
---|
118 | Examples: |
---|
119 | |
---|
120 | <enscript highlight=scheme> |
---|
121 | (irregex-match '(w/nocase "foobar") "abcFOOBARdef") => #f |
---|
122 | |
---|
123 | (irregex-match '(w/nocase "foobar") "FOOBAR") => #<match> |
---|
124 | </enscript> |
---|
125 | |
---|
126 | ===== irregex-match-data? |
---|
127 | |
---|
128 | <procedure>(irregex-match-data? <obj>)</procedure> |
---|
129 | |
---|
130 | Returns {{#t}} iff the object is a successful match result from |
---|
131 | {{irregex-search}} or {{irregex-match}}. |
---|
132 | |
---|
133 | ===== irregex-num-submatches |
---|
134 | ===== irregex-match-num-submatches |
---|
135 | |
---|
136 | <procedure>(irregex-num-submatches <irx>)</procedure><br> |
---|
137 | <procedure>(irregex-match-num-submatches <match>)</procedure> |
---|
138 | |
---|
139 | Returns the number of numbered submatches that are defined in the |
---|
140 | irregex or match object. |
---|
141 | |
---|
142 | ===== irregex-names |
---|
143 | ===== irregex-match-names |
---|
144 | |
---|
145 | <procedure>(irregex-names <irx>)</procedure><br> |
---|
146 | <procedure>(irregex-match-names <match>)</procedure> |
---|
147 | |
---|
148 | Returns an association list of named submatches that are defined in |
---|
149 | the irregex or match object. The {{car}} of each item in this list is |
---|
150 | the name of a submatch, the {{cdr}} of each item is the numerical |
---|
151 | submatch corresponding to this name. If a named submatch occurs |
---|
152 | multiple times in the irregex, it will also occur multiple times in |
---|
153 | this list. |
---|
154 | |
---|
155 | ===== irregex-match-substring |
---|
156 | ===== irregex-match-start-index |
---|
157 | ===== irregex-match-end-index |
---|
158 | |
---|
159 | <procedure>(irregex-match-substring <match> [<index-or-name>])</procedure><br> |
---|
160 | <procedure>(irregex-match-start-index <match> <index-or-name>)</procedure><br> |
---|
161 | <procedure>(irregex-match-end-index <match> <index-or-name>)</procedure> |
---|
162 | |
---|
163 | Fetches the matched substring (or its start or end offset) at the |
---|
164 | given submatch index, or named submatch. The entire match is index 0, |
---|
165 | the first 1, etc. The default is index 0. |
---|
166 | |
---|
167 | ===== irregex-match-subchunk |
---|
168 | |
---|
169 | <procedure>(irregex-match-subchunk <match> [<index-or-name>])</procedure> |
---|
170 | |
---|
171 | Generates a chunked data-type for the given match item, of the same |
---|
172 | type as the underlying chunk type (see Chunked String Matching below). |
---|
173 | This is only available if the chunk type specifies the get-subchunk |
---|
174 | API, otherwise an error is raised. |
---|
175 | |
---|
176 | ===== irregex-replace |
---|
177 | ===== irregex-replace/all |
---|
178 | |
---|
179 | <procedure>(irregex-replace <irx> <str> [<replacements> ...])</procedure><br> |
---|
180 | <procedure>(irregex-replace/all <irx> <str> [<replacements> ...])</procedure> |
---|
181 | |
---|
182 | Matches a pattern in a string, and replaces it with a (possibly empty) |
---|
183 | list of substitutions. Each {{<replacement>}} can be either a string |
---|
184 | literal, a numeric index, a symbol (as a named submatch), or a |
---|
185 | procedure which takes one argument (the match object) and returns a |
---|
186 | string. |
---|
187 | |
---|
188 | Examples: |
---|
189 | |
---|
190 | <enscript highlight=scheme> |
---|
191 | (irregex-replace "[aeiou]" "hello world" "*") => "h*llo world" |
---|
192 | |
---|
193 | (irregex-replace/all "[aeiou]" "hello world" "*") => "h*ll* w*rld" |
---|
194 | </enscript> |
---|
195 | |
---|
196 | ===== irregex-split |
---|
197 | ===== irregex-extract |
---|
198 | |
---|
199 | <procedure>(irregex-split <irx> <str> [<start> <end>])</procedure><br> |
---|
200 | <procedure>(irregex-extract <irx> <str> [<start> <end>])</procedure> |
---|
201 | |
---|
202 | {{irregex-split}} splits the string {{<str>}} into substrings divided |
---|
203 | by the pattern in {{<irx>}}. {{irregex-extract}} does the opposite, |
---|
204 | returning a list of each instance of the pattern matched disregarding |
---|
205 | the substrings in between. |
---|
206 | |
---|
207 | ===== irregex-fold |
---|
208 | |
---|
209 | <procedure>(irregex-fold <irx> <kons> <knil> <str> [<finish> <start> <end>])</procedure> |
---|
210 | |
---|
211 | This performs a fold operation over every non-overlapping place |
---|
212 | {{<irx>}} occurs in the string {{str}}. |
---|
213 | |
---|
214 | The {{<kons>}} procedure takes the following signature: |
---|
215 | |
---|
216 | <enscript highlight=scheme> |
---|
217 | (<kons> <from-index> <match> <seed>) |
---|
218 | </enscript> |
---|
219 | |
---|
220 | where {{<from-index>}} is the index from where we started searching |
---|
221 | (initially {{<start>}} and thereafter the end index of the last |
---|
222 | match), {{<match>}} is the resulting match-data object, and {{<seed>}} |
---|
223 | is the accumulated fold result starting with {{<knil>}}. |
---|
224 | |
---|
225 | The rationale for providing the {{<from-index>}} (which is not |
---|
226 | provided in the SCSH {{regexp-fold}} utility), is because this |
---|
227 | information is useful (e.g. for extracting the unmatched portion of |
---|
228 | the string before the current match, as needed in |
---|
229 | {{irregex-replace}}), and not otherwise directly accessible. |
---|
230 | |
---|
231 | The optional {{<finish>}} takes two arguments: |
---|
232 | |
---|
233 | <enscript highlight=scheme> |
---|
234 | (<finish> <from-index> <seed>) |
---|
235 | </enscript> |
---|
236 | |
---|
237 | which simiarly allows you to pick up the unmatched tail of the string, |
---|
238 | and defaults to just returning the {{<seed>}}. |
---|
239 | |
---|
240 | {{<start>}} and {{<end>}} are numeric indices letting you specify the |
---|
241 | boundaries of the string on which you want to fold. |
---|
242 | |
---|
243 | To extract all instances of a match out of a string, you can use |
---|
244 | |
---|
245 | <enscript highlight=scheme> |
---|
246 | (map irregex-match-substring |
---|
247 | (irregex-fold <irx> |
---|
248 | (lambda (i m s) (cons m s)) |
---|
249 | '() |
---|
250 | <str> |
---|
251 | (lambda (i s) (reverse s)))) |
---|
252 | </enscript> |
---|
253 | |
---|
254 | ==== Extended SRE Syntax |
---|
255 | |
---|
256 | Irregex provides the first native implementation of SREs (Scheme |
---|
257 | Regular Expressions), and includes many extensions necessary both for |
---|
258 | minimal POSIX compatibility, as well as for modern extensions found in |
---|
259 | libraries such as PCRE. |
---|
260 | |
---|
261 | The following table summarizes the SRE syntax, with detailed |
---|
262 | explanations following. |
---|
263 | |
---|
264 | ;; basic patterns |
---|
265 | <string> ; literal string |
---|
266 | (seq <sre> ...) ; sequence |
---|
267 | (: <sre> ...) |
---|
268 | (or <sre> ...) ; alternation |
---|
269 | |
---|
270 | ;; optional/multiple patterns |
---|
271 | (? <sre> ...) ; 0 or 1 matches |
---|
272 | (* <sre> ...) ; 0 or more matches |
---|
273 | (+ <sre> ...) ; 1 or more matches |
---|
274 | (= <n> <sre> ...) ; exactly <n> matches |
---|
275 | (>= <n> <sre> ...) ; <n> or more matches |
---|
276 | (** <from> <to> <sre> ...) ; <n> to <m> matches |
---|
277 | (?? <sre> ...) ; non-greedy (non-greedy) pattern: (0 or 1) |
---|
278 | (*? <sre> ...) ; non-greedy kleene star |
---|
279 | (**? <from> <to> <sre> ...) ; non-greedy range |
---|
280 | |
---|
281 | ;; submatch patterns |
---|
282 | (submatch <sre> ...) ; numbered submatch |
---|
283 | ($ <sre> ...) |
---|
284 | (submatch-named <name> <sre> ...) ; named submatch |
---|
285 | (=> <name> <sre> ...) |
---|
286 | (backref <n-or-name>) ; match a previous submatch |
---|
287 | |
---|
288 | ;; toggling case-sensitivity |
---|
289 | (w/case <sre> ...) ; enclosed <sre>s are case-sensitive |
---|
290 | (w/nocase <sre> ...) ; enclosed <sre>s are case-insensitive |
---|
291 | |
---|
292 | ;; character sets |
---|
293 | <char> ; singleton char set |
---|
294 | (<string>) ; set of chars |
---|
295 | (or <cset-sre> ...) ; set union |
---|
296 | (~ <cset-sre> ...) ; set complement (i.e. [^...]) |
---|
297 | (- <cset-sre> ...) ; set difference |
---|
298 | (& <cset-sre> ...) ; set intersection |
---|
299 | (/ <range-spec> ...) ; pairs of chars as ranges |
---|
300 | |
---|
301 | ;; named character sets |
---|
302 | any |
---|
303 | nonl |
---|
304 | ascii |
---|
305 | lower-case lower |
---|
306 | upper-case upper |
---|
307 | alphabetic alpha |
---|
308 | numeric num |
---|
309 | alphanumeric alphanum alnum |
---|
310 | punctuation punct |
---|
311 | graphic graph |
---|
312 | whitespace white space |
---|
313 | printing print |
---|
314 | control cntrl |
---|
315 | hex-digit xdigit |
---|
316 | |
---|
317 | ;; assertions and conditionals |
---|
318 | bos eos ; beginning/end of string |
---|
319 | bol eol ; beginning/end of line |
---|
320 | bow eow ; beginning/end of word |
---|
321 | nwb ; non-word-boundary |
---|
322 | (look-ahead <sre> ...) ; zero-width look-ahead assertion |
---|
323 | (look-behind <sre> ...) ; zero-width look-behind assertion |
---|
324 | (neg-look-ahead <sre> ...) ; zero-width negative look-ahead assertion |
---|
325 | (neg-look-behind <sre> ...) ; zero-width negative look-behind assertion |
---|
326 | (atomic <sre> ...) ; for (?>...) independent patterns |
---|
327 | (if <test> <pass> [<fail>]) ; conditional patterns |
---|
328 | commit ; don't backtrack beyond this (i.e. cut) |
---|
329 | |
---|
330 | ;; backwards compatibility |
---|
331 | (posix-string <string>) ; embed a POSIX string literal |
---|
332 | |
---|
333 | ===== Basic SRE Patterns |
---|
334 | |
---|
335 | The simplest SRE is a literal string, which matches that string |
---|
336 | exactly. |
---|
337 | |
---|
338 | <enscript highlight=scheme> |
---|
339 | (irregex-search "needle" "hayneedlehay") => #<match> |
---|
340 | </enscipt> |
---|
341 | |
---|
342 | By default the match is case-sensitive, though you can control this |
---|
343 | either with the compiler flags or local overrides: |
---|
344 | |
---|
345 | <enscript highlight=scheme> |
---|
346 | (irregex-search "needle" "haynEEdlehay") => #f |
---|
347 | |
---|
348 | (irregex-search (irregex "needle" 'i) "haynEEdlehay") => #<match> |
---|
349 | |
---|
350 | (irregex-search '(w/nocase "needle") "haynEEdlehay") => #<match> |
---|
351 | </enscript> |
---|
352 | |
---|
353 | You can use {{w/case}} to switch back to case-sensitivity inside a |
---|
354 | {{w/nocase}} or when the SRE was compiled with {{'i}}: |
---|
355 | |
---|
356 | <enscript highlight=scheme> |
---|
357 | (irregex-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => #<match> |
---|
358 | |
---|
359 | (irregex-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f |
---|
360 | </enscript> |
---|
361 | |
---|
362 | Of course, literal strings by themselves aren't very interesting |
---|
363 | regular expressions, so we want to be able to compose them. The most |
---|
364 | basic way to do this is with the {{seq}} operator (or its abbreviation |
---|
365 | {{:}}), which matches one or more patterns consecutively: |
---|
366 | |
---|
367 | <enscript highlight=scheme> |
---|
368 | (irregex-search '(: "one" space "two" space "three") "one two three") => #<match> |
---|
369 | </enscript> |
---|
370 | |
---|
371 | As you may have noticed above, the {{w/case}} and {{w/nocase}} |
---|
372 | operators allowed multiple SREs in a sequence - other operators that |
---|
373 | take any number of arguments (e.g. the repetition operators below) |
---|
374 | allow such implicit sequences. |
---|
375 | |
---|
376 | To match any one of a set of patterns use the {{or}} alternation |
---|
377 | operator: |
---|
378 | |
---|
379 | <enscript highlight=scheme> |
---|
380 | (irregex-search '(or "eeney" "meeney" "miney") "meeney") => #<match> |
---|
381 | |
---|
382 | (irregex-search '(or "eeney" "meeney" "miney") "moe") => #f |
---|
383 | </enscript> |
---|
384 | |
---|
385 | ===== SRE Repetition Patterns |
---|
386 | |
---|
387 | There are also several ways to control the number of times a pattern |
---|
388 | is matched. The simplest of these is {{?}} which just optionally |
---|
389 | matches the pattern: |
---|
390 | |
---|
391 | <enscript highlight=scheme> |
---|
392 | (irregex-search '(: "match" (? "es") "!") "matches!") => #<match> |
---|
393 | |
---|
394 | (irregex-search '(: "match" (? "es") "!") "match!") => #<match> |
---|
395 | |
---|
396 | (irregex-search '(: "match" (? "es") "!") "matche!") => #<match> |
---|
397 | </enscript> |
---|
398 | |
---|
399 | To optionally match any number of times, use {{*}}, the Kleene star: |
---|
400 | |
---|
401 | <enscript highlight=scheme> |
---|
402 | (irregex-search '(: "<" (* (~ #\>)) ">") "<html>") => #<match> |
---|
403 | |
---|
404 | (irregex-search '(: "<" (* (~ #\>)) ">") "<>") => #<match> |
---|
405 | |
---|
406 | (irregex-search '(: "<" (* (~ #\>)) ">") "<html") => #f |
---|
407 | </enscript> |
---|
408 | |
---|
409 | Often you want to match any number of times, but at least one time is |
---|
410 | required, and for that you use {{+}}: |
---|
411 | |
---|
412 | <enscript highlight=scheme> |
---|
413 | (irregex-search '(: "<" (+ (~ #\>)) ">") "<html>") => #<match> |
---|
414 | |
---|
415 | (irregex-search '(: "<" (+ (~ #\>)) ">") "<a>") => #<match> |
---|
416 | |
---|
417 | (irregex-search '(: "<" (+ (~ #\>)) ">") "<>") => #f |
---|
418 | </enscript> |
---|
419 | |
---|
420 | More generally, to match at least a given number of times, use {{>=}}: |
---|
421 | |
---|
422 | <enscript highlight=scheme> |
---|
423 | (irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => #<match> |
---|
424 | |
---|
425 | (irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => #<match> |
---|
426 | |
---|
427 | (irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f |
---|
428 | </enscript> |
---|
429 | |
---|
430 | To match a specific number of times exactly, use {{=}}: |
---|
431 | |
---|
432 | <enscript highlight=scheme> |
---|
433 | (irregex-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => #<match> |
---|
434 | |
---|
435 | (irregex-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f |
---|
436 | </enscript> |
---|
437 | |
---|
438 | And finally, the most general form is {{**}} which specifies a range |
---|
439 | of times to match. All of the earlier forms are special cases of this. |
---|
440 | |
---|
441 | <enscript highlight=scheme> |
---|
442 | (irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => #<match> |
---|
443 | |
---|
444 | (irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f |
---|
445 | </enscript> |
---|
446 | |
---|
447 | There are also so-called "non-greedy" variants of these repetition |
---|
448 | operators, by convention suffixed with an additional {{?}}. Since the |
---|
449 | normal repetition patterns can match any of the allotted repetition |
---|
450 | range, these operators will match a string if and only if the normal |
---|
451 | versions matched. However, when the endpoints of which submatch |
---|
452 | matched where are taken into account (specifically, all matches when |
---|
453 | using irregex-search since the endpoints of the match itself matter), |
---|
454 | the use of a non-greedy repetition can change the result. |
---|
455 | |
---|
456 | So, whereas {{?}} can be thought to mean "match or don't match," |
---|
457 | {{??}} means "don't match or match." {{*}} typically consumes as much |
---|
458 | as possible, but {{*?}} tries first to match zero times, and only |
---|
459 | consumes one at a time if that fails. If you have a greedy operator |
---|
460 | followed by a non-greedy operator in the same pattern, they can |
---|
461 | produce surprisins results as they compete to make the match longer or |
---|
462 | shorter. If this seems confusing, that's because it is. Non-greedy |
---|
463 | repetitions are defined only in terms of the specific backtracking |
---|
464 | algorithm used to implement them, which for compatibility purposes |
---|
465 | always means the Perl algorithm. Thus, when using these patterns you |
---|
466 | force IrRegex to use a backtracking engine, and can't rely on |
---|
467 | efficient execution. |
---|
468 | |
---|
469 | ===== SRE Character Sets |
---|
470 | |
---|
471 | Perhaps more common than matching specific strings is matching any of |
---|
472 | a set of characters. You can use the {{or}} alternation pattern on a |
---|
473 | list of single-character strings to simulate a character set, but this |
---|
474 | is too clumsy for everyday use so SRE syntax allows a number of |
---|
475 | shortcuts. |
---|
476 | |
---|
477 | A single character matches that character literally, a trivial |
---|
478 | character class. More conveniently, a list holding a single element |
---|
479 | which is a string refers to the character set composed of every |
---|
480 | character in the string. |
---|
481 | |
---|
482 | <enscript highlight=scheme> |
---|
483 | (irregex-match '(* #\-) "---") => #<match> |
---|
484 | |
---|
485 | (irregex-match '(* #\-) "-_-") => #f |
---|
486 | |
---|
487 | (irregex-match '(* ("aeiou")) "oui") => #<match> |
---|
488 | |
---|
489 | (irregex-match '(* ("aeiou")) "ouais") => #f |
---|
490 | </enscript> |
---|
491 | |
---|
492 | Ranges are introduced with the \q{/} operator. Any strings or |
---|
493 | characters in the \q{/} are flattened and then taken in pairs to |
---|
494 | represent the start and end points, inclusive, of character ranges. |
---|
495 | |
---|
496 | <enscript highlight=scheme> |
---|
497 | (irregex-match '(* (/ "AZ09")) "R2D2") => #<match> |
---|
498 | |
---|
499 | (irregex-match '(* (/ "AZ09")) "C-3PO") => #f |
---|
500 | </enscript> |
---|
501 | |
---|
502 | In addition, a number of set algebra operations are provided. \q{or}, |
---|
503 | of course, has the same meaning, but when all the options are |
---|
504 | character sets it can be thought of as the set union operator. This |
---|
505 | is further extended by the \q{&} set intersection, \q{-} set |
---|
506 | difference, and \q{~} set complement operators. |
---|
507 | |
---|
508 | <enscript highlight=scheme> |
---|
509 | (irregex-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => #<match> |
---|
510 | |
---|
511 | (irregex-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f |
---|
512 | |
---|
513 | (irregex-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => #<match> |
---|
514 | |
---|
515 | (irregex-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f |
---|
516 | </enscript> |
---|
517 | |
---|
518 | ===== SRE Assertion Patterns |
---|
519 | |
---|
520 | There are a number of times it can be useful to assert something about |
---|
521 | the area around a pattern without explicitly making it part of the |
---|
522 | pattern. The most common cases are specifically anchoring some |
---|
523 | pattern to the beginning or end of a word or line or even the whole |
---|
524 | string. For example, to match on the end of a word: |
---|
525 | |
---|
526 | <enscript highlight=scheme> |
---|
527 | (irregex-match '(: "foo" eow) "foo") => #<match> |
---|
528 | |
---|
529 | (irregex-match '(: "foo" eow) "foo!") => #<match> |
---|
530 | |
---|
531 | (irregex-match '(: "foo" eow) "foof") => #f |
---|
532 | </enscript> |
---|
533 | |
---|
534 | The {{bow}}, {{bol}}, {{eol}}, {{bos}} and {{eos}} work similarly. |
---|
535 | {{nwb}} asserts that you are not in a word-boundary - if replaced for |
---|
536 | {{eow}} in the above examples it would reverse all the results. |
---|
537 | |
---|
538 | There is no {{wb}}, since you tend to know from context whether it |
---|
539 | would be the beginning or end of a word, but if you need it you can |
---|
540 | always use {{(or bow eow)}}. |
---|
541 | |
---|
542 | Somewhat more generally, Perl introduced positive and negative |
---|
543 | look-ahead and look-behind patterns. Perl look-behind patterns are |
---|
544 | limited to a fixed length, however the IrRegex versions have no such |
---|
545 | limit. |
---|
546 | |
---|
547 | <enscript highlight=scheme> |
---|
548 | (irregex-match '(: "regular" (look-ahead " expression")) |
---|
549 | "regular expression") |
---|
550 | => #<match> |
---|
551 | </enscript> |
---|
552 | |
---|
553 | The most general case, of course, would be an \q{and} pattern to |
---|
554 | complement the \q{or} pattern - all the patterns must match or the |
---|
555 | whole pattern fails. This may be provided in a future release, |
---|
556 | although it (and look-ahead and look-behind assertions) are unlikely |
---|
557 | to be compiled efficiently. |
---|
558 | |
---|
559 | ===== SRE Utility Patterns |
---|
560 | |
---|
561 | The following utility regular expressions are also provided for common |
---|
562 | patterns that people are eternally reinventing. They are not |
---|
563 | necessarily the official patterns matching the RFC definitions of the |
---|
564 | given data, because of the way that such patterns tend to be used. |
---|
565 | There are three general usages for regexps: |
---|
566 | |
---|
567 | ; searching : search for a pattern matching a desired object in a larger text |
---|
568 | |
---|
569 | ; validation : determine whether an entire string matches a pattern |
---|
570 | |
---|
571 | ; extraction : given a string already known to be valid, extract certain fields from it as submatches |
---|
572 | |
---|
573 | In some cases, but not always, these will overlap. When they are |
---|
574 | different, {{irregex-search}} will naturally always want the searching |
---|
575 | version, so IrRegex provides that version. |
---|
576 | |
---|
577 | As an example where these might be different, consider a URL. If you |
---|
578 | want to match all the URLs in some arbitrary text, you probably want |
---|
579 | to exclude a period or comma at the tail end of a URL, since it's more |
---|
580 | likely being used as punctuation rather than part of the URL, despite |
---|
581 | the fact that it would be valid URL syntax. |
---|
582 | |
---|
583 | Another problem with the RFC definitions is the standard itself may |
---|
584 | have become irrelevant. For example, the pattern IrRegex provides for |
---|
585 | email addresses doesn't match quoted local parts (e.g. |
---|
586 | {{"first last"@domain.com}}) because these are increasingly rare, and |
---|
587 | unsupported by enough software that it's better to discourage their use. |
---|
588 | Conversely, technically consecutive periods |
---|
589 | (e.g. {{first..last@domain.com}}) are not allowed in email addresses, but |
---|
590 | most email software does allow this, and in fact such addresses are |
---|
591 | quite common in Japan. |
---|
592 | |
---|
593 | The current patterns provided are: |
---|
594 | |
---|
595 | newline ; general newline pattern (crlf, cr, lf) |
---|
596 | integer ; an integer |
---|
597 | real ; a real number (including scientific) |
---|
598 | string ; a "quoted" string |
---|
599 | symbol ; an R5RS Scheme symbol |
---|
600 | ipv4-address ; a numeric decimal ipv4 address |
---|
601 | ipv6-address ; a numeric hexadecimal ipv6 address |
---|
602 | domain ; a domain name |
---|
603 | email ; an email address |
---|
604 | http-url ; a URL beginning with https?:// |
---|
605 | |
---|
606 | Because of these issues the exact definitions of these patterns are |
---|
607 | subject to be changed, but will be documented clearly when they are |
---|
608 | finalized. More common patterns are also planned, but as what you |
---|
609 | want increases in complexity it's probably better to use a real |
---|
610 | parser. |
---|
611 | |
---|
612 | ==== Supported PCRE Syntax |
---|
613 | |
---|
614 | Since the PCRE syntax is so overwhelming complex, it's easier to just |
---|
615 | list what we *don't* support for now. Refer to the |
---|
616 | [[http://pcre.org/pcre.txt|PCRE documentation]] for details. You |
---|
617 | should be using the SRE syntax anyway! |
---|
618 | |
---|
619 | Unicode character classes ({{\P}}) are not supported, but will be |
---|
620 | in an upcoming release. {{\C}} named characters are not supported. |
---|
621 | |
---|
622 | Callbacks, subroutine patterns and recursive patterns are not |
---|
623 | supported. ({{*FOO}}) patterns are not supported and may never be. |
---|
624 | |
---|
625 | {{\G}} and {{\K}} are not supported. |
---|
626 | |
---|
627 | Octal character escapes are not supported because they are ambiguous |
---|
628 | with back-references - just use hex character escapes. |
---|
629 | |
---|
630 | Other than that everything should work, including named submatches, |
---|
631 | zero-width assertions, conditional patterns, etc. |
---|
632 | |
---|
633 | In addition, {{\<}} and {{\>}} act as beginning-of-word and end-of-word |
---|
634 | marks, respectively, as in Emacs regular expressions. |
---|
635 | |
---|
636 | Also, two escapes are provided to embed SRE patterns inside PCRE |
---|
637 | strings, {{"\'<sre>"}} and {{"(*'<sre>)"}}. For example, to match a |
---|
638 | comma-delimited list of integers you could use |
---|
639 | |
---|
640 | <enscript highlight=scheme> |
---|
641 | "\\'integer(,\\'integer)*" |
---|
642 | </enscript> |
---|
643 | |
---|
644 | and to match a URL in angle brackets you could use |
---|
645 | |
---|
646 | <enscript highlight=scheme> |
---|
647 | "<('*http-url)>" |
---|
648 | </enscript> |
---|
649 | |
---|
650 | Note in the second example the enclosing {{"('*...)"}} syntax is needed |
---|
651 | because the Scheme reader would consider the closing {{">"}} as part of |
---|
652 | the SRE symbol. |
---|
653 | |
---|
654 | The following chart gives a quick reference from PCRE form to the SRE |
---|
655 | equivalent: |
---|
656 | |
---|
657 | ;; basic syntax |
---|
658 | "^" ;; bos (or eos inside (?m: ...)) |
---|
659 | "$" ;; eos (or eos inside (?m: ...)) |
---|
660 | "." ;; nonl |
---|
661 | "a?" ;; (? a) |
---|
662 | "a*" ;; (* a) |
---|
663 | "a+" ;; (+ a) |
---|
664 | "a??" ;; (?? a) |
---|
665 | "a*?" ;; (*? a) |
---|
666 | "a+?" ;; (+? a) |
---|
667 | "a{n,m}" ;; (** n m a) |
---|
668 | |
---|
669 | ;; grouping |
---|
670 | "(...)" ;; (submatch ...) |
---|
671 | "(?:...)" ;; (: ...) |
---|
672 | "(?i:...)" ;; (w/nocase ...) |
---|
673 | "(?-i:...)" ;; (w/case ...) |
---|
674 | "(?<name>...)" ;; (=> <name>...) |
---|
675 | |
---|
676 | ;; character classes |
---|
677 | "[aeiou]" ;; ("aeiou") |
---|
678 | "[^aeiou]" ;; (~ "aeiou") |
---|
679 | "[a-z]" ;; (/ "az") or (/ "a" "z") |
---|
680 | "[[:alpha:]]" ;; alpha |
---|
681 | |
---|
682 | ;; assertions |
---|
683 | "(?=...)" ;; (look-ahead ...) |
---|
684 | "(?!...)" ;; (neg-look-ahead ...) |
---|
685 | "(?<=...)" ;; (look-behind ...) |
---|
686 | "(?<!...)" ;; (neg-look-behind ...) |
---|
687 | "(?(test)pass|fail)" ;; (if test pass fail) |
---|
688 | "(*COMMIT)" ;; commit |
---|
689 | |
---|
690 | ==== Chunked String Matching |
---|
691 | |
---|
692 | It's often desirable to perform regular expression matching over |
---|
693 | sequences of characters not represented as a single string. The most |
---|
694 | obvious example is a text-buffer data structure, but you may also want |
---|
695 | to match over lists or trees of strings (i.e. ropes), over only |
---|
696 | certain ranges within a string, over an input port, etc. With |
---|
697 | existing regular expression libraries, the only way to accomplish this |
---|
698 | is by converting the abstract sequence into a freshly allocated |
---|
699 | string. This can be expensive, or even impossible if the object is a |
---|
700 | text-buffer opened onto a 500MB file. |
---|
701 | |
---|
702 | IrRegex provides a chunked string API specifically for this purpose. |
---|
703 | You define a chunking API with {{make-irregex-chunker}}: |
---|
704 | |
---|
705 | ===== make-irregex-chunker |
---|
706 | |
---|
707 | <procedure>(make-irregex-chunker <get-next> <get-string> [<get-start> <get-end> <get-substring> <get-subchunk>])</procedure> |
---|
708 | |
---|
709 | where |
---|
710 | |
---|
711 | {{(<get-next> chunk) => }} returns the next chunk, or {{#f}} if there are no more chunks |
---|
712 | |
---|
713 | {{(<get-string> chunk) => }} a string source for the chunk |
---|
714 | |
---|
715 | {{(<get-start> chunk) => }} the start index of the result of {{<get-string>}} (defaults to always 0) |
---|
716 | |
---|
717 | {{(<get-end> chunk) => }} the end (exclusive) of the string (defaults to {{string-length}} of the source string) |
---|
718 | |
---|
719 | {{(<get-substring> cnk1 i cnk2 j) => }} a substring for the range between the chunk {{cnk1}} starting at index {{i}} and ending at {{cnk2}} at index {{j}} |
---|
720 | |
---|
721 | {{(<get-subchunk> cnk1 i cnk2 j) => }} as above but returns a new chunked data type instead of a string (optional) |
---|
722 | |
---|
723 | There are two important constraints on the {{<get-next>}} procedure. |
---|
724 | It must return an {{eq?}} identical object when called multiple times |
---|
725 | on the same chunk, and it must not return a chunk with an empty string |
---|
726 | (start == end). This second constraint is for performance reasons - |
---|
727 | we push the work of possibly filtering empty chunks to the chunker |
---|
728 | since there are many chunk types for which empty strings aren't |
---|
729 | possible, and this work is thus not needed. Note that the initial |
---|
730 | chunk passed to match on is allowed to be empty. |
---|
731 | |
---|
732 | {{<get-substring>}} is provided for possible performance improvements |
---|
733 | - without it a default is used. {{<get-subchunk>}} is optional - |
---|
734 | without it you may not use {{irregex-match-subchunk}} described above. |
---|
735 | |
---|
736 | You can then match chunks of these types with the following |
---|
737 | procedures: |
---|
738 | |
---|
739 | ===== irregex-search/chunked |
---|
740 | ===== irregex-match/chunked |
---|
741 | |
---|
742 | <procedure>(irregex-search/chunked <irx> <chunker> <chunk> [<start>])</procedure><br> |
---|
743 | <procedure>(irregex-match/chunked <irx> <chunker> <chunk> [<start>])</procedure> |
---|
744 | |
---|
745 | These return normal match-data objects. |
---|
746 | |
---|
747 | Example: |
---|
748 | |
---|
749 | To match against a simple, flat list of strings use: |
---|
750 | |
---|
751 | <enscript highlight=scheme> |
---|
752 | (define (rope->string rope1 start rope2 end) |
---|
753 | (if (eq? rope1 rope2) |
---|
754 | (substring (car rope1) start end) |
---|
755 | (let loop ((rope (cdr rope1)) |
---|
756 | (res (list (substring (car rope1) start)))) |
---|
757 | (if (eq? rope rope2) |
---|
758 | (string-concatenate-reverse ; from SRFI-13 |
---|
759 | (cons (substring (car rope) 0 end) res)) |
---|
760 | (loop (cdr rope) (cons (car rope) res)))))) |
---|
761 | |
---|
762 | (define rope-chunker |
---|
763 | (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) |
---|
764 | car |
---|
765 | (lambda (x) 0) |
---|
766 | (lambda (x) (string-length (car x))) |
---|
767 | rope->string)) |
---|
768 | |
---|
769 | (irregex-search/chunked <pat> rope-chunker <list-of-strings>) |
---|
770 | </enscript> |
---|
771 | |
---|
772 | Here we are just using the default start, end and substring behaviors, |
---|
773 | so the above chunker could simply be defined as: |
---|
774 | |
---|
775 | <enscript highlight=scheme> |
---|
776 | (define rope-chunker |
---|
777 | (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) car)) |
---|
778 | </enscript> |
---|
779 | |
---|
780 | ===== irregex-fold/chunked |
---|
781 | |
---|
782 | <procedure>(irregex-fold/chunked <irx> <kons> <knil> <chunker> <chunk> [<finish> [<start-index>]])</procedure> |
---|
783 | |
---|
784 | Chunked version of {{irregex-fold}}. |
---|
785 | |
---|
786 | ==== Utilities |
---|
787 | |
---|
788 | The following procedures are also available. |
---|
789 | |
---|
790 | ===== irregex-quote |
---|
791 | |
---|
792 | <procedure>(irregex-quote <str>)</procedure> |
---|
793 | |
---|
794 | Returns a new string with any special regular expression characters |
---|
795 | escaped, to match the original string literally in POSIX regular |
---|
796 | expressions. |
---|
797 | |
---|
798 | ===== irregex-opt |
---|
799 | |
---|
800 | <procedure>(irregex-opt <list-of-strings>)</procedure> |
---|
801 | |
---|
802 | Returns an optimized SRE matching any of the literal strings |
---|
803 | in the list, like Emacs' \q{regexp-opt}. Note this optimization |
---|
804 | doesn't help when irregex is able to build a DFA. |
---|
805 | |
---|
806 | ===== sre->string |
---|
807 | |
---|
808 | <procedure>(sre->string <sre>)</procedure> |
---|
809 | |
---|
810 | Convert an SRE to a POSIX-style regular expression string, if |
---|
811 | possible. |
---|
812 | |
---|
813 | === License |
---|
814 | |
---|
815 | Copyright (c) 2005-2010 Alex Shinn |
---|
816 | All rights reserved. |
---|
817 | |
---|
818 | Redistribution and use in source and binary forms, with or without |
---|
819 | modification, are permitted provided that the following conditions |
---|
820 | are met: |
---|
821 | |
---|
822 | 1. Redistributions of source code must retain the above copyright |
---|
823 | notice, this list of conditions and the following disclaimer. |
---|
824 | 2. Redistributions in binary form must reproduce the above copyright |
---|
825 | notice, this list of conditions and the following disclaimer in the |
---|
826 | documentation and/or other materials provided with the distribution. |
---|
827 | 3. The name of the author may not be used to endorse or promote products |
---|
828 | derived from this software without specific prior written permission. |
---|
829 | |
---|
830 | THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR |
---|
831 | IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES |
---|
832 | OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. |
---|
833 | IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, |
---|
834 | INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT |
---|
835 | NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, |
---|
836 | DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY |
---|
837 | THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT |
---|
838 | (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF |
---|
839 | THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
---|
840 | |
---|
841 | === References |
---|
842 | |
---|
843 | * R. Kelsey, W. Clinger, J. Rees (eds.): ''[[http://www.schemers.org/Documents/Standards/R5RS/|Revised^5 Report on the Algorithmic Language Scheme]]'' |
---|
844 | |
---|
845 | * Russ Cox: ''[[http://swtch.com/~rsc/regexp/|Implementing Regular Expressions]]'' |
---|
846 | |
---|
847 | * Russ Cox: ''[[http://compilers.iecc.com/comparch/article/07-10-026|Henry Spencer's Tcl Regex Library]]'' |
---|
848 | |
---|
849 | * Olin Shivers: ''[[http://www.scsh.net/docu/post/sre.html|Proposed SRE regular-expression notation]]'' |
---|
850 | |
---|
851 | * Olin Shivers: ''[[http://www.scsh.net/docu/html/man-Z-H-7.html|Pattern-matching strings with regular expressions]]'' |
---|
852 | |
---|
853 | * Shiro Kawai: ''[[http://practical-scheme.net/gauche/man/gauche-refe_49.html|Gauche Scheme - Regular Expressions]]'' |
---|
854 | |
---|
855 | * Damian Conway: ''[[http://www.perl.com/pub/a/2002/08/22/exegesis5.html|Perl6 Exegesis 5 - Regular Expressions]]'' |
---|
856 | |
---|
857 | * Philip Hazel: ''[[http://www.pcre.org/|PCRE - Perl Compatible Regular Expressions]]'' |
---|