1 | [[tags: manual]] |
---|
2 | [[toc:]] |
---|
3 | |
---|
4 | == Unit regex |
---|
5 | |
---|
6 | This library unit provides support for regular expressions. The regular |
---|
7 | expression package used is {{irregex}} |
---|
8 | written by Alex Shinn. Irregex supports most Perl-extensions and is |
---|
9 | written completely in Scheme. |
---|
10 | |
---|
11 | This library unit exposes two APIs: the standard Chicken API described below, and the |
---|
12 | original irregex API. You may use either API or both: |
---|
13 | |
---|
14 | (require-library regex) ; required for either API, or both |
---|
15 | (import regex) ; import the Chicken regex API |
---|
16 | (import irregex) ; import the original irregex API |
---|
17 | |
---|
18 | Regular expressions may be either POSIX-style strings (with most PCRE |
---|
19 | extensions) or an SCSH-style SRE. There is no {{(rx ...)}} syntax - |
---|
20 | just use normal Scheme lists, with quasiquote if you like. |
---|
21 | |
---|
22 | === grep |
---|
23 | |
---|
24 | <procedure>(grep REGEX LIST [ACCESSOR])</procedure> |
---|
25 | |
---|
26 | Returns all items of {{LIST}} that match the regular expression |
---|
27 | {{REGEX}}. This procedure could be defined as follows: |
---|
28 | |
---|
29 | <enscript highlight=scheme> |
---|
30 | (define (grep regex lst) |
---|
31 | (filter (lambda (x) (string-search regex x)) lst) ) |
---|
32 | </enscript> |
---|
33 | |
---|
34 | {{ACCESSOR}} is an optional accessor-procedure applied to each |
---|
35 | element before doing the match. It should take a single argument |
---|
36 | and return a string that will then be used in the regular expression |
---|
37 | matching. {{ACCESSOR}} defaults to the identity function. |
---|
38 | |
---|
39 | |
---|
40 | === glob->regexp |
---|
41 | |
---|
42 | <procedure>(glob->regexp PATTERN [SRE?])</procedure> |
---|
43 | |
---|
44 | Converts the file-pattern {{PATTERN}} into a regular expression. |
---|
45 | |
---|
46 | <enscript highlight=scheme> |
---|
47 | (glob->regexp "foo.*") |
---|
48 | => "foo\..*" |
---|
49 | </enscript> |
---|
50 | |
---|
51 | {{PATTERN}} should follow "glob" syntax. Allowed wildcards are |
---|
52 | |
---|
53 | * |
---|
54 | [C...] |
---|
55 | [C1-C2] |
---|
56 | [-C...] |
---|
57 | ? |
---|
58 | |
---|
59 | {{glob->regexp}} returns a regular expression object if the optional |
---|
60 | argument {{SRE?}} is false or not given, otherwise the SRE of the |
---|
61 | computed regular expression is returned. |
---|
62 | |
---|
63 | |
---|
64 | === regexp |
---|
65 | |
---|
66 | <procedure>(regexp STRING [IGNORECASE [IGNORESPACE [UTF8]]])</procedure> |
---|
67 | |
---|
68 | Returns a precompiled regular expression object for {{string}}. |
---|
69 | The optional arguments {{IGNORECASE}}, {{IGNORESPACE}} and {{UTF8}} |
---|
70 | specify whether the regular expression should be matched with case- or whitespace-differences |
---|
71 | ignored, or whether the string should be treated as containing UTF-8 encoded |
---|
72 | characters, respectively. |
---|
73 | |
---|
74 | Note that code that uses regular expressions heavily should always |
---|
75 | use them in precompiled form, which is likely to be much faster than |
---|
76 | passing strings to any of the regular-expression routines described |
---|
77 | below. |
---|
78 | |
---|
79 | |
---|
80 | === regexp? |
---|
81 | |
---|
82 | <procedure>(regexp? X)</procedure> |
---|
83 | |
---|
84 | Returns {{#t}} if {{X}} is a precompiled regular expression, |
---|
85 | or {{#f}} otherwise. |
---|
86 | |
---|
87 | |
---|
88 | === string-match |
---|
89 | === string-match-positions |
---|
90 | |
---|
91 | <procedure>(string-match REGEXP STRING)</procedure><br> |
---|
92 | <procedure>(string-match-positions REGEXP STRING)</procedure> |
---|
93 | |
---|
94 | Matches the regular expression in {{REGEXP}} (a string or a precompiled |
---|
95 | regular expression) with |
---|
96 | {{STRING}} and returns either {{#f}} if the match failed, |
---|
97 | or a list of matching groups, where the first element is the complete |
---|
98 | match. For each matching group the |
---|
99 | result-list contains either: {{#f}} for a non-matching but optional |
---|
100 | group; a list of start- and end-position of the match in {{STRING}} |
---|
101 | (in the case of {{string-match-positions}}); or the matching |
---|
102 | substring (in the case of {{string-match}}). Note that the exact string |
---|
103 | is matched. For searching a pattern inside a string, see below. |
---|
104 | Note also that {{string-match}} is implemented by calling |
---|
105 | {{string-search}} with the regular expression wrapped in {{^ ... $}}. |
---|
106 | If invoked with a precompiled regular expression argument (by using |
---|
107 | {{regexp}}), {{string-match}} is identical to {{string-search}}. |
---|
108 | |
---|
109 | |
---|
110 | === string-search |
---|
111 | === string-search-positions |
---|
112 | |
---|
113 | <procedure>(string-search REGEXP STRING [START [RANGE]])</procedure><br> |
---|
114 | <procedure>(string-search-positions REGEXP STRING [START [RANGE]])</procedure> |
---|
115 | |
---|
116 | Searches for the first match of the regular expression in |
---|
117 | {{REGEXP}} with {{STRING}}. The search can be limited to |
---|
118 | {{RANGE}} characters. |
---|
119 | |
---|
120 | |
---|
121 | === string-split-fields |
---|
122 | |
---|
123 | <procedure>(string-split-fields REGEXP STRING [MODE [START]])</procedure> |
---|
124 | |
---|
125 | Splits {{STRING}} into a list of fields according to {{MODE}}, |
---|
126 | where {{MODE}} can be the keyword {{#:infix}} ({{REGEXP}} |
---|
127 | matches field separator), the keyword {{#:suffix}} ({{REGEXP}} |
---|
128 | matches field terminator) or {{#t}} ({{REGEXP}} matches field), |
---|
129 | which is the default. |
---|
130 | |
---|
131 | <enscript highlight=scheme> |
---|
132 | (define s "this is a string 1, 2, 3,") |
---|
133 | |
---|
134 | (string-split-fields "[^ ]+" s) |
---|
135 | |
---|
136 | => ("this" "is" "a" "string" "1," "2," "3,") |
---|
137 | |
---|
138 | (string-split-fields " " s #:infix) |
---|
139 | |
---|
140 | => ("this" "is" "a" "string" "1," "2," "3,") |
---|
141 | |
---|
142 | (string-split-fields "," s #:suffix) |
---|
143 | |
---|
144 | => ("this is a string 1" " 2" " 3") |
---|
145 | </enscript> |
---|
146 | |
---|
147 | |
---|
148 | === string-substitute |
---|
149 | |
---|
150 | <procedure>(string-substitute REGEXP SUBST STRING [MODE])</procedure> |
---|
151 | |
---|
152 | Searches substrings in {{STRING}} that match {{REGEXP}} |
---|
153 | and substitutes them with the string {{SUBST}}. The substitution |
---|
154 | can contain references to subexpressions in |
---|
155 | {{REGEXP}} with the {{\NUM}} notation, where {{NUM}} |
---|
156 | refers to the NUMth parenthesized expression. The optional argument |
---|
157 | {{MODE}} defaults to 1 and specifies the number of the match to |
---|
158 | be substituted. Any non-numeric index specifies that all matches are to |
---|
159 | be substituted. |
---|
160 | |
---|
161 | <enscript highlight=scheme> |
---|
162 | (string-substitute "([0-9]+) (eggs|chicks)" "\\2 (\\1)" "99 eggs or 99 chicks" 2) |
---|
163 | => "99 eggs or chicks (99)" |
---|
164 | </enscript> |
---|
165 | |
---|
166 | Note that a regular expression that matches an empty string will |
---|
167 | signal an error. |
---|
168 | |
---|
169 | |
---|
170 | === string-substitute* |
---|
171 | |
---|
172 | <procedure>(string-substitute* STRING SMAP [MODE])</procedure> |
---|
173 | |
---|
174 | Substitutes elements of {{STRING}} with {{string-substitute}} according to {{SMAP}}. |
---|
175 | {{SMAP}} should be an association-list where each element of the list |
---|
176 | is a pair of the form {{(MATCH . REPLACEMENT)}}. Every occurrence of |
---|
177 | the regular expression {{MATCH}} in {{STRING}} will be replaced by the string |
---|
178 | {{REPLACEMENT}} |
---|
179 | |
---|
180 | <enscript highlight=scheme> |
---|
181 | (string-substitute* "<h1>Hello, world!</h1>" '(("<[/A-Za-z0-9]+>" . ""))) |
---|
182 | |
---|
183 | => "Hello, world!" |
---|
184 | </enscript> |
---|
185 | |
---|
186 | |
---|
187 | === regexp-escape |
---|
188 | |
---|
189 | <procedure>(regexp-escape STRING)</procedure> |
---|
190 | |
---|
191 | Escapes all special characters in {{STRING}} with {{\}}, so that the string can be embedded |
---|
192 | into a regular expression. |
---|
193 | |
---|
194 | <enscript highlight=scheme> |
---|
195 | (regexp-escape "^[0-9]+:.*$") |
---|
196 | => "\\^\\[0-9\\]\\+:.\n.\\*\\$" |
---|
197 | </enscript> |
---|
198 | |
---|
199 | === Extended SRE Syntax |
---|
200 | |
---|
201 | The following table summarizes the SRE syntax, with detailed explanations following. |
---|
202 | |
---|
203 | ;; basic patterns |
---|
204 | <string> ; literal string |
---|
205 | (seq <sre> ...) ; sequence |
---|
206 | (: <sre> ...) |
---|
207 | (or <sre> ...) ; alternation |
---|
208 | |
---|
209 | ;; optional/multiple patterns |
---|
210 | (? <sre> ...) ; 0 or 1 matches |
---|
211 | (* <sre> ...) ; 0 or more matches |
---|
212 | (+ <sre> ...) ; 1 or more matches |
---|
213 | (= <n> <sre> ...) ; exactly <n> matches |
---|
214 | (>= <n> <sre> ...) ; <n> or more matches |
---|
215 | (** <from> <to> <sre> ...) ; <n> to <m> matches |
---|
216 | (?? <sre> ...) ; non-greedy (non-greedy) pattern: (0 or 1) |
---|
217 | (*? <sre> ...) ; non-greedy kleene star |
---|
218 | (**? <from> <to> <sre> ...) ; non-greedy range |
---|
219 | |
---|
220 | ;; submatch patterns |
---|
221 | (submatch <sre> ...) ; numbered submatch |
---|
222 | (submatch-named <name> <sre> ...) ; named submatch |
---|
223 | (=> <name> <sre> ...) |
---|
224 | (backref <n-or-name>) ; match a previous submatch |
---|
225 | |
---|
226 | ;; toggling case-sensitivity |
---|
227 | (w/case <sre> ...) ; enclosed <sre>s are case-sensitive |
---|
228 | (w/nocase <sre> ...) ; enclosed <sre>s are case-insensitive |
---|
229 | |
---|
230 | ;; character sets |
---|
231 | <char> ; singleton char set |
---|
232 | (<string>) ; set of chars |
---|
233 | (or <cset-sre> ...) ; set union |
---|
234 | (~ <cset-sre> ...) ; set complement (i.e. [^...]) |
---|
235 | (- <cset-sre> ...) ; set difference |
---|
236 | (& <cset-sre> ...) ; set intersection |
---|
237 | (/ <range-spec> ...) ; pairs of chars as ranges |
---|
238 | |
---|
239 | ;; named character sets |
---|
240 | any |
---|
241 | nonl |
---|
242 | ascii |
---|
243 | lower-case lower |
---|
244 | upper-case upper |
---|
245 | alphabetic alpha |
---|
246 | numeric num |
---|
247 | alphanumeric alphanum alnum |
---|
248 | punctuation punct |
---|
249 | graphic graph |
---|
250 | whitespace white space |
---|
251 | printing print |
---|
252 | control cntrl |
---|
253 | hex-digit xdigit |
---|
254 | |
---|
255 | ;; assertions and conditionals |
---|
256 | bos eos ; beginning/end of string |
---|
257 | bol eol ; beginning/end of line |
---|
258 | bow eow ; beginning/end of word |
---|
259 | nwb ; non-word-boundary |
---|
260 | (look-ahead <sre> ...) ; zero-width look-ahead assertion |
---|
261 | (look-behind <sre> ...) ; zero-width look-behind assertion |
---|
262 | (neg-look-ahead <sre> ...) ; zero-width negative look-ahead assertion |
---|
263 | (neg-look-behind <sre> ...) ; zero-width negative look-behind assertion |
---|
264 | (atomic <sre> ...) ; for (?>...) independent patterns |
---|
265 | (if <test> <pass> [<fail>]) ; conditional patterns |
---|
266 | commit ; don't backtrack beyond this (i.e. cut) |
---|
267 | |
---|
268 | ;; backwards compatibility |
---|
269 | (posix-string <string>) ; embed a POSIX string literal |
---|
270 | |
---|
271 | ==== Basic SRE Patterns |
---|
272 | |
---|
273 | The simplest SRE is a literal string, which matches that string exactly. |
---|
274 | |
---|
275 | (string-search "needle" "hayneedlehay") => <match> |
---|
276 | |
---|
277 | By default the match is case-sensitive, though you can control this either with the compiler flags or local overrides: |
---|
278 | |
---|
279 | (string-search "needle" "haynEEdlehay") => #f |
---|
280 | |
---|
281 | (string-search (irregex "needle" 'i) "haynEEdlehay") => <match> |
---|
282 | |
---|
283 | (string-search '(w/nocase "needle") "haynEEdlehay") => <match> |
---|
284 | |
---|
285 | You can use {{w/case}} to switch back to case-sensitivity inside a {{w/nocase}}: |
---|
286 | |
---|
287 | (string-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => <match> |
---|
288 | |
---|
289 | (string-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f |
---|
290 | |
---|
291 | Of course, literal strings by themselves aren't very interesting |
---|
292 | regular expressions, so we want to be able to compose them. The most |
---|
293 | basic way to do this is with the {{seq}} operator (or its abbreviation {{:}}), |
---|
294 | which matches one or more patterns consecutively: |
---|
295 | |
---|
296 | (string-search '(: "one" space "two" space "three") "one two three") => <match> |
---|
297 | |
---|
298 | As you may have noticed above, the {{w/case}} and {{w/nocase}} operators |
---|
299 | allowed multiple SREs in a sequence - other operators that take any |
---|
300 | number of arguments (e.g. the repetition operators below) allow such |
---|
301 | implicit sequences. |
---|
302 | |
---|
303 | To match any one of a set of patterns use the or alternation operator: |
---|
304 | |
---|
305 | (string-search '(or "eeney" "meeney" "miney") "meeney") => <match> |
---|
306 | |
---|
307 | (string-search '(or "eeney" "meeney" "miney") "moe") => #f |
---|
308 | |
---|
309 | ==== SRE Repetition Patterns |
---|
310 | |
---|
311 | There are also several ways to control the number of times a pattern |
---|
312 | is matched. The simplest of these is {{?}} which just optionally matches |
---|
313 | the pattern: |
---|
314 | |
---|
315 | (string-search '(: "match" (? "es") "!") "matches!") => <match> |
---|
316 | |
---|
317 | (string-search '(: "match" (? "es") "!") "match!") => <match> |
---|
318 | |
---|
319 | (string-search '(: "match" (? "es") "!") "matche!") => #f |
---|
320 | |
---|
321 | To optionally match any number of times, use {{*}}, the Kleene star: |
---|
322 | |
---|
323 | (string-search '(: "<" (* (~ #\>)) ">") "<html>") => <match> |
---|
324 | |
---|
325 | (string-search '(: "<" (* (~ #\>)) ">") "<>") => <match> |
---|
326 | |
---|
327 | (string-search '(: "<" (* (~ #\>)) ">") "<html") => #f |
---|
328 | |
---|
329 | Often you want to match any number of times, but at least one time is required, and for that you use {{+}}: |
---|
330 | |
---|
331 | (string-search '(: "<" (+ (~ #\>)) ">") "<html>") => <match> |
---|
332 | |
---|
333 | (string-search '(: "<" (+ (~ #\>)) ">") "<a>") => <match> |
---|
334 | |
---|
335 | (string-search '(: "<" (+ (~ #\>)) ">") "<>") => #f |
---|
336 | |
---|
337 | More generally, to match at least a given number of times, use {{>=}}: |
---|
338 | |
---|
339 | (string-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => <match> |
---|
340 | |
---|
341 | (string-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => <match> |
---|
342 | |
---|
343 | (string-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f |
---|
344 | |
---|
345 | To match a specific number of times exactly, use {=}: |
---|
346 | |
---|
347 | (string-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => <match> |
---|
348 | |
---|
349 | (string-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f |
---|
350 | |
---|
351 | And finally, the most general form is {{**}} which specifies a range |
---|
352 | of times to match. All of the earlier forms are special cases of this. |
---|
353 | |
---|
354 | (string-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => <match> |
---|
355 | |
---|
356 | (string-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f |
---|
357 | |
---|
358 | There are also so-called "non-greedy" variants of these repetition |
---|
359 | operators, by convention suffixed with an additional {{?}}. Since the |
---|
360 | normal repetition patterns can match any of the allotted repetition |
---|
361 | range, these operators will match a string if and only if the normal |
---|
362 | versions matched. However, when the endpoints of which submatch |
---|
363 | matched where are taken into account (specifically, all matches when |
---|
364 | using string-search since the endpoints of the match itself matter), |
---|
365 | the use of a non-greedy repetition can change the result. |
---|
366 | |
---|
367 | So, whereas {{?}} can be thought to mean "match or don't match," {{??}} means |
---|
368 | "don't match or match." {{*}} typically consumes as much as possible, but |
---|
369 | {{*?}} tries first to match zero times, and only consumes one at a time if |
---|
370 | that fails. If you have a greedy operator followed by a non-greedy |
---|
371 | operator in the same pattern, they can produce surprisins results as |
---|
372 | they compete to make the match longer or shorter. If this seems |
---|
373 | confusing, that's because it is. Non-greedy repetitions are defined |
---|
374 | only in terms of the specific backtracking algorithm used to implement |
---|
375 | them, which for compatibility purposes always means the Perl |
---|
376 | algorithm. Thus, when using these patterns you force IrRegex to use a |
---|
377 | backtracking engine, and can't rely on efficient execution. |
---|
378 | |
---|
379 | ==== SRE Character Sets |
---|
380 | |
---|
381 | Perhaps more common than matching specific strings is matching any of |
---|
382 | a set of characters. You can use the or alternation pattern on a list |
---|
383 | of single-character strings to simulate a character set, but this is |
---|
384 | too clumsy for everyday use so SRE syntax allows a number of |
---|
385 | shortcuts. |
---|
386 | |
---|
387 | A single character matches that character literally, a trivial |
---|
388 | character class. More conveniently, a list holding a single element |
---|
389 | which is a string refers to the character set composed of every |
---|
390 | character in the string. |
---|
391 | |
---|
392 | (string-match '(* #\-) "---") => <match> |
---|
393 | |
---|
394 | (string-match '(* #\-) "-_-") => #f |
---|
395 | |
---|
396 | (string-match '(* ("aeiou")) "oui") => <match> |
---|
397 | |
---|
398 | (string-match '(* ("aeiou")) "ouais") => #f |
---|
399 | |
---|
400 | Ranges are introduced with the {{/}} operator. Any strings or characters |
---|
401 | in the {{/}} are flattened and then taken in pairs to represent the start |
---|
402 | and end points, inclusive, of character ranges. |
---|
403 | |
---|
404 | (string-match '(* (/ "AZ09")) "R2D2") => <match> |
---|
405 | |
---|
406 | (string-match '(* (/ "AZ09")) "C-3PO") => #f |
---|
407 | |
---|
408 | In addition, a number of set algebra operations are provided. or, of |
---|
409 | course, has the same meaning, but when all the options are character |
---|
410 | sets it can be thought of as the set union operator. This is further |
---|
411 | extended by the {{&}} set intersection, {{-}} set difference, and {{~}} set |
---|
412 | complement operators. |
---|
413 | |
---|
414 | (string-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => <match> |
---|
415 | |
---|
416 | (string-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f |
---|
417 | |
---|
418 | (string-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => <match> |
---|
419 | |
---|
420 | (string-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f |
---|
421 | |
---|
422 | ==== SRE Assertion Patterns |
---|
423 | |
---|
424 | There are a number of times it can be useful to assert something about |
---|
425 | the area around a pattern without explicitly making it part of the |
---|
426 | pattern. The most common cases are specifically anchoring some pattern |
---|
427 | to the beginning or end of a word or line or even the whole |
---|
428 | string. For example, to match on the end of a word: |
---|
429 | |
---|
430 | (string-match '(: "foo" eow) "foo") => <match> |
---|
431 | |
---|
432 | (string-match '(: "foo" eow) "foo!") => <match> |
---|
433 | |
---|
434 | (string-match '(: "foo" eow) "foof") => #f |
---|
435 | |
---|
436 | The {{bow}}, {{bol}}, {{eol}}, {{bos}} and {{eos}} work similarly. {{nwb}} asserts that you |
---|
437 | are not in a word-boundary - if replaced for {{eow}} in the above examples |
---|
438 | it would reverse all the results. |
---|
439 | |
---|
440 | There is no {{wb}}, since you tend to know from context whether it |
---|
441 | would be the beginning or end of a word, but if you need it you can |
---|
442 | always use (or bow eow). |
---|
443 | |
---|
444 | Somewhat more generally, Perl introduced positive and negative |
---|
445 | look-ahead and look-behind patterns. Perl look-behind patterns are |
---|
446 | limited to a fixed length, however the IrRegex versions have no such |
---|
447 | limit. |
---|
448 | |
---|
449 | (string-match '(: "regular" (look-ahead " expression")) "regular expression") => <match> |
---|
450 | |
---|
451 | The most general case, of course, would be an and pattern to |
---|
452 | complement the or pattern - all the patterns must match or the whole |
---|
453 | pattern fails. This may be provided in a future release, although it |
---|
454 | (and look-ahead and look-behind assertions) are unlikely to be |
---|
455 | compiled efficiently. |
---|
456 | |
---|
457 | |
---|
458 | --- |
---|
459 | Previous: [[Unit extras]] |
---|
460 | |
---|
461 | Next: [[Unit srfi-1]] |
---|