source: project/wiki/man/4/Unit regex @ 18110

Last change on this file since 18110 was 18110, checked in by felix winkelmann, 9 years ago

merged some changes from the core repo

File size: 16.3 KB
Line 
1[[tags: manual]]
2[[toc:]]
3
4== Unit regex
5
6This library unit provides support for regular expressions. The regular
7expression package used is {{irregex}}
8written by Alex Shinn. Irregex supports most Perl-extensions and is
9written completely in Scheme.
10
11This library unit exposes two APIs: the standard Chicken API described below, and the
12original irregex API.  You may use either API or both:
13
14 (require-library regex)   ; required for either API, or both
15 (import regex)            ; import the Chicken regex API
16 (import irregex)          ; import the original irregex API
17
18Regular expressions may be either POSIX-style strings (with most PCRE
19extensions) or an SCSH-style SRE. There is no {{(rx ...)}} syntax -
20just use normal Scheme lists, with quasiquote if you like.
21
22=== grep
23
24<procedure>(grep REGEX LIST [ACCESSOR])</procedure>
25
26Returns all items of {{LIST}} that match the regular expression
27{{REGEX}}.  This procedure could be defined as follows:
28
29<enscript highlight=scheme>
30(define (grep regex lst)
31  (filter (lambda (x) (string-search regex x)) lst) )
32</enscript>
33
34{{ACCESSOR}} is an optional accessor-procedure applied to each
35element before doing the match. It should take a single argument
36and return a string that will then be used in the regular expression
37matching. {{ACCESSOR}} defaults to the identity function.
38
39
40=== glob->regexp
41
42<procedure>(glob->regexp PATTERN [SRE?])</procedure>
43
44Converts the file-pattern {{PATTERN}} into a regular expression.
45
46<enscript highlight=scheme>
47(glob->regexp "foo.*")
48=> "foo\..*"
49</enscript>
50
51{{PATTERN}} should follow "glob" syntax. Allowed wildcards are
52
53 *
54 [C...]
55 [C1-C2]
56 [-C...]
57 ?
58
59{{glob->regexp}} returns a regular expression object if the optional
60argument {{SRE?}} is false or not given, otherwise the SRE of the
61computed regular expression is returned.
62
63
64=== regexp
65
66<procedure>(regexp STRING [IGNORECASE [IGNORESPACE [UTF8]]])</procedure>
67
68Returns a precompiled regular expression object for {{string}}.
69The optional arguments {{IGNORECASE}}, {{IGNORESPACE}} and {{UTF8}}
70specify whether the regular expression should be matched with case- or whitespace-differences
71ignored, or whether the string should be treated as containing UTF-8 encoded
72characters, respectively.
73
74Note that code that uses regular expressions heavily should always
75use them in precompiled form, which is likely to be much faster than
76passing strings to any of the regular-expression routines described
77below.
78
79
80=== regexp?
81
82<procedure>(regexp? X)</procedure>
83
84Returns {{#t}} if {{X}} is a precompiled regular expression,
85or {{#f}} otherwise.
86
87
88=== string-match
89=== string-match-positions
90
91<procedure>(string-match REGEXP STRING)</procedure><br>
92<procedure>(string-match-positions REGEXP STRING)</procedure>
93
94Matches the regular expression in {{REGEXP}} (a string or a precompiled
95regular expression) with
96{{STRING}} and returns either {{#f}} if the match failed,
97or a list of matching groups, where the first element is the complete
98match.  For each matching group the
99result-list contains either: {{#f}} for a non-matching but optional
100group; a list of start- and end-position of the match in {{STRING}}
101(in the case of {{string-match-positions}}); or the matching
102substring (in the case of {{string-match}}). Note that the exact string
103is matched. For searching a pattern inside a string, see below.
104Note also that {{string-match}} is implemented by calling
105{{string-search}} with the regular expression wrapped in {{^ ... $}}.
106If invoked with a precompiled regular expression argument (by using
107{{regexp}}), {{string-match}} is identical to {{string-search}}.
108
109
110=== string-search
111=== string-search-positions
112
113<procedure>(string-search REGEXP STRING [START [RANGE]])</procedure><br>
114<procedure>(string-search-positions REGEXP STRING [START [RANGE]])</procedure>
115
116Searches for the first match of the regular expression in
117{{REGEXP}} with {{STRING}}. The search can be limited to
118{{RANGE}} characters.
119
120
121=== string-split-fields
122
123<procedure>(string-split-fields REGEXP STRING [MODE [START]])</procedure>
124
125Splits {{STRING}} into a list of fields according to {{MODE}},
126where {{MODE}} can be the keyword {{#:infix}} ({{REGEXP}}
127matches field separator), the keyword {{#:suffix}} ({{REGEXP}}
128matches field terminator) or {{#t}} ({{REGEXP}} matches field),
129which is the default.
130
131<enscript highlight=scheme>
132(define s "this is a string 1, 2, 3,")
133
134(string-split-fields "[^ ]+" s)
135
136  => ("this" "is" "a" "string" "1," "2," "3,")
137
138(string-split-fields " " s #:infix)
139
140  => ("this" "is" "a" "string" "1," "2," "3,")
141
142(string-split-fields "," s #:suffix)
143 
144  => ("this is a string 1" " 2" " 3")
145</enscript>
146
147
148=== string-substitute
149
150<procedure>(string-substitute REGEXP SUBST STRING [MODE])</procedure>
151
152Searches substrings in {{STRING}} that match {{REGEXP}}
153and substitutes them with the string {{SUBST}}. The substitution
154can contain references to subexpressions in
155{{REGEXP}} with the {{\NUM}} notation, where {{NUM}}
156refers to the NUMth parenthesized expression. The optional argument
157{{MODE}} defaults to 1 and specifies the number of the match to
158be substituted. Any non-numeric index specifies that all matches are to
159be substituted.
160
161<enscript highlight=scheme>
162(string-substitute "([0-9]+) (eggs|chicks)" "\\2 (\\1)" "99 eggs or 99 chicks" 2)
163=> "99 eggs or chicks (99)"
164</enscript>
165
166Note that a regular expression that matches an empty string will
167signal an error.
168
169
170=== string-substitute*
171
172<procedure>(string-substitute* STRING SMAP [MODE])</procedure>
173
174Substitutes elements of {{STRING}} with {{string-substitute}} according to {{SMAP}}.
175{{SMAP}} should be an association-list where each element of the list
176is a pair of the form {{(MATCH . REPLACEMENT)}}. Every occurrence of
177the regular expression {{MATCH}} in {{STRING}} will be replaced by the string
178{{REPLACEMENT}}
179
180<enscript highlight=scheme>
181(string-substitute* "<h1>Hello, world!</h1>" '(("<[/A-Za-z0-9]+>" . "")))
182
183=>  "Hello, world!"
184</enscript>
185
186
187=== regexp-escape
188
189<procedure>(regexp-escape STRING)</procedure>
190
191Escapes all special characters in {{STRING}} with {{\}}, so that the string can be embedded
192into a regular expression.
193
194<enscript highlight=scheme>
195(regexp-escape "^[0-9]+:.*$")
196=>  "\\^\\[0-9\\]\\+:.\n.\\*\\$"
197</enscript>
198
199=== Extended SRE Syntax
200
201The following table summarizes the SRE syntax, with detailed explanations following.
202
203  ;; basic patterns
204  <string>                          ; literal string
205  (seq <sre> ...)                   ; sequence
206  (: <sre> ...)
207  (or <sre> ...)                    ; alternation
208 
209  ;; optional/multiple patterns
210  (? <sre> ...)                     ; 0 or 1 matches
211  (* <sre> ...)                     ; 0 or more matches
212  (+ <sre> ...)                     ; 1 or more matches
213  (= <n> <sre> ...)                 ; exactly <n> matches
214  (>= <n> <sre> ...)                ; <n> or more matches
215  (** <from> <to> <sre> ...)        ; <n> to <m> matches
216  (?? <sre> ...)                    ; non-greedy (non-greedy) pattern: (0 or 1)
217  (*? <sre> ...)                    ; non-greedy kleene star
218  (**? <from> <to> <sre> ...)       ; non-greedy range
219 
220  ;; submatch patterns
221  (submatch <sre> ...)              ; numbered submatch
222  (submatch-named <name> <sre> ...) ; named submatch
223  (=> <name> <sre> ...)
224  (backref <n-or-name>)             ; match a previous submatch
225 
226  ;; toggling case-sensitivity
227  (w/case <sre> ...)                ; enclosed <sre>s are case-sensitive
228  (w/nocase <sre> ...)              ; enclosed <sre>s are case-insensitive
229 
230  ;; character sets
231  <char>                            ; singleton char set
232  (<string>)                        ; set of chars
233  (or <cset-sre> ...)               ; set union
234  (~ <cset-sre> ...)                ; set complement (i.e. [^...])
235  (- <cset-sre> ...)                ; set difference
236  (& <cset-sre> ...)                ; set intersection
237  (/ <range-spec> ...)              ; pairs of chars as ranges
238 
239  ;; named character sets
240  any
241  nonl
242  ascii
243  lower-case     lower
244  upper-case     upper
245  alphabetic     alpha
246  numeric        num
247  alphanumeric   alphanum  alnum
248  punctuation    punct
249  graphic        graph
250  whitespace     white     space
251  printing       print
252  control        cntrl
253  hex-digit      xdigit
254 
255  ;; assertions and conditionals
256  bos eos                           ; beginning/end of string
257  bol eol                           ; beginning/end of line
258  bow eow                           ; beginning/end of word
259  nwb                               ; non-word-boundary
260  (look-ahead <sre> ...)            ; zero-width look-ahead assertion
261  (look-behind <sre> ...)           ; zero-width look-behind assertion
262  (neg-look-ahead <sre> ...)        ; zero-width negative look-ahead assertion
263  (neg-look-behind <sre> ...)       ; zero-width negative look-behind assertion
264  (atomic <sre> ...)                ; for (?>...) independent patterns
265  (if <test> <pass> [<fail>])       ; conditional patterns
266  commit                            ; don't backtrack beyond this (i.e. cut)
267 
268  ;; backwards compatibility
269  (posix-string <string>)           ; embed a POSIX string literal
270
271====  Basic SRE Patterns
272
273The simplest SRE is a literal string, which matches that string exactly.
274
275  (string-search "needle" "hayneedlehay") => <match>
276
277By default the match is case-sensitive, though you can control this either with the compiler flags or local overrides:
278
279  (string-search "needle" "haynEEdlehay") => #f
280 
281  (string-search (irregex "needle" 'i) "haynEEdlehay") => <match>
282 
283  (string-search '(w/nocase "needle") "haynEEdlehay") => <match>
284
285You can use {{w/case}} to switch back to case-sensitivity inside a {{w/nocase}}:
286
287  (string-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => <match>
288 
289  (string-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f
290
291Of course, literal strings by themselves aren't very interesting
292regular expressions, so we want to be able to compose them. The most
293basic way to do this is with the {{seq}} operator (or its abbreviation {{:}}),
294which matches one or more patterns consecutively:
295
296  (string-search '(: "one" space "two" space "three") "one two three") => <match>
297
298As you may have noticed above, the {{w/case}} and {{w/nocase}} operators
299allowed multiple SREs in a sequence - other operators that take any
300number of arguments (e.g. the repetition operators below) allow such
301implicit sequences.
302
303To match any one of a set of patterns use the or alternation operator:
304
305  (string-search '(or "eeney" "meeney" "miney") "meeney") => <match>
306
307  (string-search '(or "eeney" "meeney" "miney") "moe") => #f
308
309====  SRE Repetition Patterns
310
311There are also several ways to control the number of times a pattern
312is matched. The simplest of these is {{?}} which just optionally matches
313the pattern:
314
315  (string-search '(: "match" (? "es") "!") "matches!") => <match>
316 
317  (string-search '(: "match" (? "es") "!") "match!") => <match>
318 
319  (string-search '(: "match" (? "es") "!") "matche!") => #f
320
321To optionally match any number of times, use {{*}}, the Kleene star:
322
323  (string-search '(: "<" (* (~ #\>)) ">") "<html>") => <match>
324 
325  (string-search '(: "<" (* (~ #\>)) ">") "<>") => <match>
326 
327  (string-search '(: "<" (* (~ #\>)) ">") "<html") => #f
328
329Often you want to match any number of times, but at least one time is required, and for that you use {{+}}:
330
331  (string-search '(: "<" (+ (~ #\>)) ">") "<html>") => <match>
332 
333  (string-search '(: "<" (+ (~ #\>)) ">") "<a>") => <match>
334 
335  (string-search '(: "<" (+ (~ #\>)) ">") "<>") => #f
336
337More generally, to match at least a given number of times, use {{>=}}:
338
339  (string-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => <match>
340
341  (string-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => <match>
342
343  (string-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f
344
345To match a specific number of times exactly, use {=}:
346
347  (string-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => <match>
348 
349  (string-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f
350
351And finally, the most general form is {{**}} which specifies a range
352of times to match. All of the earlier forms are special cases of this.
353
354  (string-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => <match>
355
356  (string-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f
357
358There are also so-called "non-greedy" variants of these repetition
359operators, by convention suffixed with an additional {{?}}. Since the
360normal repetition patterns can match any of the allotted repetition
361range, these operators will match a string if and only if the normal
362versions matched. However, when the endpoints of which submatch
363matched where are taken into account (specifically, all matches when
364using string-search since the endpoints of the match itself matter),
365the use of a non-greedy repetition can change the result.
366
367So, whereas {{?}} can be thought to mean "match or don't match," {{??}} means
368"don't match or match." {{*}} typically consumes as much as possible, but
369{{*?}} tries first to match zero times, and only consumes one at a time if
370that fails. If you have a greedy operator followed by a non-greedy
371operator in the same pattern, they can produce surprisins results as
372they compete to make the match longer or shorter. If this seems
373confusing, that's because it is. Non-greedy repetitions are defined
374only in terms of the specific backtracking algorithm used to implement
375them, which for compatibility purposes always means the Perl
376algorithm. Thus, when using these patterns you force IrRegex to use a
377backtracking engine, and can't rely on efficient execution.
378
379====  SRE Character Sets
380
381Perhaps more common than matching specific strings is matching any of
382a set of characters. You can use the or alternation pattern on a list
383of single-character strings to simulate a character set, but this is
384too clumsy for everyday use so SRE syntax allows a number of
385shortcuts.
386
387A single character matches that character literally, a trivial
388character class. More conveniently, a list holding a single element
389which is a string refers to the character set composed of every
390character in the string.
391
392  (string-match '(* #\-) "---") => <match>
393 
394  (string-match '(* #\-) "-_-") => #f
395 
396  (string-match '(* ("aeiou")) "oui") => <match>
397 
398  (string-match '(* ("aeiou")) "ouais") => #f
399
400Ranges are introduced with the {{/}} operator. Any strings or characters
401in the {{/}} are flattened and then taken in pairs to represent the start
402and end points, inclusive, of character ranges.
403
404  (string-match '(* (/ "AZ09")) "R2D2") => <match>
405 
406  (string-match '(* (/ "AZ09")) "C-3PO") => #f
407
408In addition, a number of set algebra operations are provided. or, of
409course, has the same meaning, but when all the options are character
410sets it can be thought of as the set union operator. This is further
411extended by the {{&}} set intersection, {{-}} set difference, and {{~}} set
412complement operators.
413
414  (string-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => <match>
415 
416  (string-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f
417
418  (string-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => <match>
419 
420  (string-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f
421
422====  SRE Assertion Patterns
423
424There are a number of times it can be useful to assert something about
425the area around a pattern without explicitly making it part of the
426pattern. The most common cases are specifically anchoring some pattern
427to the beginning or end of a word or line or even the whole
428string. For example, to match on the end of a word:
429
430  (string-match '(: "foo" eow) "foo") => <match>
431 
432  (string-match '(: "foo" eow) "foo!") => <match>
433 
434  (string-match '(: "foo" eow) "foof") => #f
435
436The {{bow}}, {{bol}}, {{eol}}, {{bos}} and {{eos}} work similarly. {{nwb}} asserts that you
437are not in a word-boundary - if replaced for {{eow}} in the above examples
438it would reverse all the results.
439
440There is no {{wb}}, since you tend to know from context whether it
441would be the beginning or end of a word, but if you need it you can
442always use (or bow eow).
443
444Somewhat more generally, Perl introduced positive and negative
445look-ahead and look-behind patterns. Perl look-behind patterns are
446limited to a fixed length, however the IrRegex versions have no such
447limit.
448
449  (string-match '(: "regular" (look-ahead " expression")) "regular expression") => <match>
450
451The most general case, of course, would be an and pattern to
452complement the or pattern - all the patterns must match or the whole
453pattern fails. This may be provided in a future release, although it
454(and look-ahead and look-behind assertions) are unlikely to be
455compiled efficiently.
456
457
458---
459Previous: [[Unit extras]]
460
461Next: [[Unit srfi-1]]
Note: See TracBrowser for help on using the repository browser.