source: project/chicken/branches/release/manual/Unit regex @ 7276

Last change on this file since 7276 was 7276, checked in by felix winkelmann, 12 years ago

merged trunk

File size: 7.9 KB
Line 
1[[tags: manual]]
2[[toc:]]
3
4== Unit regex
5
6This library unit provides support for regular expressions. The regular
7expression package used is {{PCRE}} (''Perl Compatible Regular Expressions'')
8written by Philip Hazel. See [[http://www.pcre.org]] for information about
9the particular regexp flavor and extensions provided by this library.
10
11To test that PCRE support has been built into Chicken properly, try:
12
13<enscript highlight=scheme>
14(require 'regex)
15(test-feature? 'pcre) => t
16</enscript>
17
18
19=== grep
20
21 [procedure] (grep REGEX LIST)
22
23Returns all items of {{LIST}} that match the regular expression
24{{REGEX}}.  This procedure could be defined as follows:
25
26<enscript highlight=scheme>
27(define (grep regex lst)
28  (filter (lambda (x) (string-search regex x)) lst) )
29</enscript>
30
31
32=== glob->regexp
33
34 [procedure] (glob->regexp PATTERN)
35
36Converts the file-pattern {{PATTERN}} into a regular expression.
37
38<enscript highlight=scheme>
39(glob->regexp "foo.*")
40=> "foo\..*"
41</enscript>
42
43{{PATTERN}} should follow "glob" syntax. Allowed wildcards are
44
45 *
46 [C...]
47 [C1-C2]
48 [-C...]
49 ?
50
51
52=== glob?
53
54 [procedure] (glob? STRING)
55
56Does the {{STRING}} have any "glob" wildcards?
57
58A string without any "glob" wildcards does not meet the criteria,
59even though it technically is a valid "glob" file-pattern.
60
61
62=== regexp
63
64 [procedure] (regexp STRING [IGNORECASE [IGNORESPACE [UTF8]]])
65
66Returns a precompiled regular expression object for {{string}}.
67The optional arguments {{IGNORECASE}}, {{IGNORESPACE}} and {{UTF8}}
68specify whether the regular expression should be matched with case- or whitespace-differences
69ignored, or whether the string should be treated as containing UTF-8 encoded
70characters, respectively.
71
72
73=== regexp*
74
75 [procedure] (regexp* STRING [OPTIONS [TABLES]])
76
77Returns a precompiled regular expression object for {{string}}. The optional
78argument {{OPTIONS}} must be a list of option symbols. The optional argument
79{{TABLES}} must be a character definitions table (not defined here).
80
81
82Option Symbols:
83
84; caseless : Character case insensitive match
85; multiline : Equivalent to Perl's /m option
86; dotall : Equivalent to Perl's /s option
87; extended : Ignore whitespace
88; anchored : Anchor pattern match
89; dollar-endonly : `$' metacharacter in the pattern matches only at the end of the subject string
90; extra : Currently of very little use
91; notbol : First character of the string is not the beginning of a line
92; noteol : End of the string is not the end of a line
93; ungreedy : Inverts the "greediness" of the quantifiers so that they are not greedy by default
94; notempty : The empty string is not considered to be a valid match
95; utf8 : UTF-8 encoded characters
96; no-auto-capture : Disables the use of numbered capturing parentheses
97; no-utf8-check : Skip valid UTF-8 sequence check
98; auto-callout : Automatically inserts callout items (not defined here)
99; partial : Partial match ok
100; firstline : An unanchored pattern is required to match before or at the first newline
101; dupnames : Names used to identify capturing subpatterns need not be unique
102; newline-cr : Newline definition is `\r'
103; newline-lf : Newline definition is `\n'
104; newline-crlf : Newline definition is `\r\n'
105; newline-anycrlf : Newline definition is any of `\r', `\n', or `\r\n'
106; newline-any : ewline definition is any Unicode newline sequence
107; bsr-anycrlf : `\R' escape sequence matches only CR, LF, or CRLF
108; bsr-unicode : `\R' escape sequence matches only Unicode newline sequence
109
110; dfa-shortest : Currently unused
111; dfa-restart : Currently unused
112
113
114=== regexp?
115
116 [procedure] (regexp? X)
117
118Returns {{#t}} if {{X}} is a precompiled regular expression,
119or {{#f}} otherwise.
120
121
122=== regexp-optimize
123
124 [procedure] (regexp-optimize RX)
125 
126Perform available optimizations for the precompiled regular expression {{RX}}.
127Returns {{#t}} when optimization performed, and {{#f}} otherwise.
128
129
130=== string-match
131=== string-match-positions
132
133 [procedure] (string-match REGEXP STRING [START])
134 [procedure] (string-match-positions REGEXP STRING [START])
135
136Matches the regular expression in {{REGEXP}} (a string or a precompiled
137regular expression) with
138{{STRING}} and returns either {{#f}} if the match failed,
139or a list of matching groups, where the first element is the complete
140match. If the optional argument {{START}} is supplied, it specifies
141the starting position in {{STRING}}.  For each matching group the
142result-list contains either: {{#f}} for a non-matching but optional
143group; a list of start- and end-position of the match in {{STRING}}
144(in the case of {{string-match-positions}}); or the matching
145substring (in the case of {{string-match}}). Note that the exact string
146is matched. For searching a pattern inside a string, see below.
147Note also that {{string-match}} is implemented by calling
148{{string-search}} with the regular expression wrapped in {{^ ... $}}.
149If invoked with a precompiled regular expression argument (by using
150{{regexp}}), {{string-match}} is identical to {{string-search}}.
151
152
153=== string-search
154=== string-search-positions
155
156 [procedure] (string-search REGEXP STRING [START [RANGE]])
157 [procedure] (string-search-positions REGEXP STRING [START [RANGE]])
158
159Searches for the first match of the regular expression in
160{{REGEXP}} with {{STRING}}. The search can be limited to
161{{RANGE}} characters.
162
163
164=== string-split-fields
165
166 [procedure] (string-split-fields REGEXP STRING [MODE [START]])
167
168Splits {{STRING}} into a list of fields according to {{MODE}},
169where {{MODE}} can be the keyword {{#:infix}} ({{REGEXP}}
170matches field separator), the keyword {{#:suffix}} ({{REGEXP}}
171matches field terminator) or {{#t}} ({{REGEXP}} matches field),
172which is the default.
173
174<enscript highlight=scheme>
175(define s "this is a string 1, 2, 3,")
176
177(string-split-fields "[^ ]+" s)
178
179  => ("this" "is" "a" "string" "1," "2," "3,")
180
181(string-split-fields " " s #:infix)
182
183  => ("this" "is" "a" "string" "1," "2," "3,")
184
185(string-split-fields "," s #:suffix)
186 
187  => ("this is a string 1" " 2" " 3")
188</enscript>
189
190
191=== string-substitute
192
193 [procedure] (string-substitute REGEXP SUBST STRING [MODE])
194
195Searches substrings in {{STRING}} that match {{REGEXP}}
196and substitutes them with the string {{SUBST}}. The substitution
197can contain references to subexpressions in
198{{REGEXP}} with the {{\NUM}} notation, where {{NUM}}
199refers to the NUMth parenthesized expression. The optional argument
200{{MODE}} defaults to 1 and specifies the number of the match to
201be substituted. Any non-numeric index specifies that all matches are to
202be substituted.
203
204<enscript highlight=scheme>
205(string-substitute "([0-9]+) (eggs|chicks)"
206                   "\\2 (\\1)" "99 eggs or 99 chicks" 2)
207=> "99 eggs or chicks (99)"
208</enscript>
209
210Note that a regular expression that matches an empty string will
211signal an error.
212
213
214=== string-substitute*
215
216 [procedure] (string-substitute* STRING SMAP [MODE])
217
218Substitutes elements of {{STRING}} with {{string-substitute}} according to {{SMAP}}.
219{{SMAP}} should be an association-list where each element of the list
220is a pair of the form {{(MATCH . REPLACEMENT)}}. Every occurrence of
221the regular expression {{MATCH}} in {{STRING}} will be replaced by the string
222{{REPLACEMENT}}
223
224<enscript highlight=scheme>
225(string-substitute* "<h1>Hello, world!</h1>"
226                    '(("<[/A-Za-z0-9]+>" . "")))
227
228=>  "Hello, world!"
229</enscript>
230
231
232=== regexp-escape
233
234 [procedure] (regexp-escape STRING)
235
236Escapes all special characters in {{STRING}} with {{\}}, so that the string can be embedded
237into a regular expression.
238
239<enscript highlight=scheme>
240(regexp-escape "^[0-9]+:.*$")
241=>  "\\^\\[0-9\\]\\+:.\n.\\*\\$"
242</enscript>
243
244
245=== make-anchored-pattern
246
247 [procedure] (make-anchored-pattern REGEXP [WITHOUT-BOL [WITHOUT-EOL]])
248
249Makes an anchored pattern from {{REGEXP}} (a string or a precompiled regular
250expression) and returns the updated pattern. When {{WITHOUT-BOL}} is {{#t}} the
251beginning-of-line anchor is not added. When {{WITHOUT-EOL}} is {{#t}} the
252end-of-line anchor is not added.
253
254The {{WITHOUT-BOL}} and {WITHOUT-EOL}} arguments are ignored for a precompiled regular
255expression.
256
257
258Previous: [[Unit match]]
259
260Next: [[Unit srfi-18]]
Note: See TracBrowser for help on using the repository browser.