source: project/wiki/man/5/Module (chicken irregex) @ 39536

Last change on this file since 39536 was 39536, checked in by Idiomdrottning, 5 months ago

I can only stare at this example for so long before I die

File size: 31.6 KB
Line 
1[[tags: manual]]
2[[toc:]]
3
4== Module (chicken irregex)
5
6This module provides support for regular expressions, using the
7powerful ''irregex'' regular expression engine by Alex Shinn.  It
8supports both POSIX syntax with various (irregular) PCRE extensions,
9as well as SCSH's SRE syntax, with various aliases for commonly used
10patterns.  DFA matching is used when possible, otherwise a
11closure-compiled NFA approach is used.  Matching may be performed over
12standard Scheme strings, or over arbitrarily chunked streams of
13strings.
14
15On systems that support dynamic loading, the {{irregex}} module can be
16made available in the CHICKEN interpreter ({{csi}}) by entering
17
18<enscript highlight=scheme>
19(import irregex)
20</enscript>
21
22=== Procedures
23
24==== irregex
25==== string->irregex
26==== sre->irregex
27
28<procedure>(irregex <posix-string-or-sre> [<options> ...])</procedure><br>
29<procedure>(string->irregex <posix-string> [<options> ...])</procedure><br>
30<procedure>(sre->irregex <sre> [<options> ...])</procedure><br>
31
32Compiles a regular expression from either a POSIX-style regular
33expression string (with most PCRE extensions) or an SCSH-style SRE.
34There is no {{(rx ...)}} syntax - just use normal Scheme lists, with
35{{quasiquote}} if you like.
36
37Technically a string by itself could be considered a valid (though
38rather silly) SRE, so if you want to just match a literal string you
39should use something like {{(irregex `(: ,str))}}, or use the explicit
40{{(sre->irregex str)}}.
41
42The options are a list of any of the following symbols:
43
44; {{'i}}, {{'case-insensitive}} : match case-insensitively
45; {{'m}}, {{'multi-line}} : treat string as multiple lines (effects {{^}} and {{$}})
46; {{'s}}, {{'single-line}} : treat string as a single line ({{.}} can match newline)
47; {{'utf8}} : utf8-mode (assumes strings are byte-strings)
48; {{'fast}} : try to optimize the regular expression
49; {{'small}} : try to compile a smaller regular expression
50; {{'backtrack}} : enforce a backtracking implementation
51
52The {{'fast}} and {{'small}} options are heuristic guidelines and will
53not necessarily make the compiled expression faster or smaller.
54
55==== string->sre
56==== maybe-string->sre
57
58<procedure>(string->sre <str>)</procedure><br>
59<procedure>(maybe-string->sre <obj>)</procedure><br>
60
61For backwards compatibility, procedures to convert a POSIX string into
62an SRE.
63
64{{maybe-string->sre}} does the same thing, but only if the argument is
65a string, otherwise it assumes {{<obj>}} is an SRE and returns it
66as-is.  This is useful when you want to provide an API that allows
67either a POSIX string or SRE (like {{irregex}} or {{irregex-search}}
68below) - it ensures the result is an SRE.
69
70==== glob->sre
71
72<procedure>(glob->sre <str>)</procedure>
73
74Converts a basic shell-style glob to an SRE which matches only strings
75which the glob would match.  The glob characters {{[}}, {{]}} {{*}}
76and {{?}}  are supported.
77
78
79==== irregex?
80
81<procedure>(irregex? <obj>)</procedure>
82
83Returns {{#t}} iff the object is a regular expression.
84
85==== irregex-search
86
87<procedure>(irregex-search <irx> <str> [<start> <end>])</procedure>
88
89Searches for any instances of the pattern {{<irx>}} (a POSIX string, SRE
90sexp, or pre-compiled regular expression) in {{<str>}}, optionally between
91the given range.  If a match is found, returns a match object,
92otherwise returns {{#f}}.
93
94Match objects can be used to query the original range of the string or
95its submatches using the {{irregex-match-*}} procedures below.
96
97Examples:
98
99<enscript highlight=scheme>
100(irregex-search "foobar" "abcFOOBARdef") => #f
101
102(irregex-search (irregex "foobar" 'i) "abcFOOBARdef") => #<match>
103
104(irregex-search '(w/nocase "foobar") "abcFOOBARdef") => #<match>
105</enscript>
106
107Note, the actual match result is represented by a vector in the
108default implementation.  Throughout this manual, we'll just write
109{{#<match>}} to show that a successful match was returned when the
110details are not important.
111
112Matching follows the POSIX leftmost, longest semantics, when
113searching.  That is, of all possible matches in the string,
114{{irregex-search}} will return the match at the first position
115(leftmost).  If multiple matches are possible from that same first
116position, the longest match is returned.
117
118==== irregex-match
119==== irregex-match?
120
121<procedure>(irregex-match <irx> <str> [<start> <end>])</procedure>
122<procedure>(irregex-match? <irx> <str> [<start> <end>])</procedure>
123
124Like {{irregex-search}}, but performs an anchored match against the
125beginning and end of the substring specified by {{<start>}} and
126{{<end>}}, without searching.
127
128Where {{irregex-match}} returns a match object, {{irregex-match?}}
129just returns a boolean indicating whether it matched or not.
130
131Examples:
132
133<enscript highlight=scheme>
134(irregex-match '(w/nocase "foobar") "abcFOOBARdef") => #f
135
136(irregex-match '(w/nocase "foobar") "FOOBAR") => #<match>
137</enscript>
138
139==== irregex-match-data?
140
141<procedure>(irregex-match-data? <obj>)</procedure>
142
143Returns {{#t}} iff the object is a successful match result from
144{{irregex-search}} or {{irregex-match}}.
145
146==== irregex-num-submatches
147==== irregex-match-num-submatches
148
149<procedure>(irregex-num-submatches <irx>)</procedure><br>
150<procedure>(irregex-match-num-submatches <match>)</procedure>
151
152Returns the number of numbered submatches that are defined in the
153irregex or match object.
154
155==== irregex-names
156==== irregex-match-names
157
158<procedure>(irregex-names <irx>)</procedure><br>
159<procedure>(irregex-match-names <match>)</procedure>
160
161Returns an association list of named submatches that are defined in
162the irregex or match object.  The {{car}} of each item in this list is
163the name of a submatch, the {{cdr}} of each item is the numerical
164submatch corresponding to this name.  If a named submatch occurs
165multiple times in the irregex, it will also occur multiple times in
166this list.
167
168==== irregex-match-valid-index?
169
170<procedure>(irregex-match-valid-index? <match> <index-or-name>)</procedure><br>
171
172Returns {{#t}} iff the {{index-or-name}} named submatch or index is
173defined in the {{match}} object.
174
175==== irregex-match-substring
176==== irregex-match-start-index
177==== irregex-match-end-index
178
179<procedure>(irregex-match-substring <match> [<index-or-name>])</procedure><br>
180<procedure>(irregex-match-start-index <match> [<index-or-name>])</procedure><br>
181<procedure>(irregex-match-end-index <match> [<index-or-name>])</procedure>
182
183Fetches the matched substring (or its start or end offset) at the
184given submatch index, or named submatch.  The entire match is index 0,
185the first 1, etc.  The default is index 0.
186
187Returns {{#f}} if the given submatch did not match the source string (can happen when you have the submatch inside an {{or}} alternative, for example).
188
189==== irregex-match-subchunk
190==== irregex-match-start-chunk
191==== irregex-match-end-chunk
192
193<procedure>(irregex-match-subchunk <match> [<index-or-name>])</procedure>
194<procedure>(irregex-match-start-chunk <match> [<index-or-name>])</procedure>
195<procedure>(irregex-match-end-chunk <match> [<index-or-name>])</procedure>
196
197Access the chunks delimiting the submatch index, or named submatch.
198
199{{irregex-match-subchunk}} generates a chunked data-type for the given
200match item, of the same type as the underlying chunk type (see Chunked
201String Matching below).  This is only available if the chunk type
202specifies the get-subchunk API, otherwise an error is raised.
203
204Returns {{#f}} if the given submatch did not match the source string (can happen when you have the submatch inside an {{or}} alternative, for example).
205
206==== irregex-replace
207==== irregex-replace/all
208
209<procedure>(irregex-replace <irx> <str> [<replacements> ...])</procedure><br>
210<procedure>(irregex-replace/all <irx> <str> [<replacements> ...])</procedure>
211
212Matches a pattern in a string, and replaces it with a (possibly empty)
213list of substitutions.  Each {{<replacement>}} can be either a string
214literal, a numeric index, a symbol (as a named submatch), or a
215procedure which takes one argument (the match object) and returns a
216string.
217
218Examples:
219
220<enscript highlight=scheme>
221(irregex-replace "[aeiou]" "hello world" "*") => "h*llo world"
222
223(irregex-replace/all "[aeiou]" "hello world" "*") => "h*ll* w*rld"
224
225(irregex-replace/all '(* "foo ") "foo foo platter" "*") => "**p*l*a*t*t*e*r"
226
227(irregex-replace "(.)(.)" "ab" 2 1 "*")  => "ba*"
228
229(irregex-replace "...bar" "xxfoobar" (lambda (m)
230              (string-reverse (irregex-match-substring m)))) => "xxraboof"
231
232(irregex-replace "(...)(bar)" "xxfoobar"  2 (lambda (m)
233              (string-reverse (irregex-match-substring m 1)))) => "xxbaroof"
234</enscript>
235
236==== irregex-split
237==== irregex-extract
238
239<procedure>(irregex-split <irx> <str> [<start> <end>])</procedure><br>
240<procedure>(irregex-extract <irx> <str> [<start> <end>])</procedure>
241
242{{irregex-split}} splits the string {{<str>}} into substrings divided
243by the pattern in {{<irx>}}.  {{irregex-extract}} does the opposite,
244returning a list of each instance of the pattern matched disregarding
245the substrings in between.
246
247Empty matches will result in subsequent single character string in
248{{irregex-split}}, or empty strings in {{irregex-extract}}.
249
250<enscript highlight="scheme">
251(irregex-split "[aeiou]*" "foobarbaz") => '("f" "b" "r" "b" "z")
252
253(irregex-extract "[aeiou]*" "foobarbaz") => '("" "oo" "" "a" "" "" "a" "")
254</enscript>
255
256
257==== irregex-fold
258
259<procedure>(irregex-fold <irx> <kons> <knil> <str> [<finish> <start> <end>])</procedure>
260
261This performs a fold operation over every non-overlapping place
262{{<irx>}} occurs in the string {{str}}.
263
264The {{<kons>}} procedure takes the following signature:
265
266<enscript highlight=scheme>
267(<kons> <from-index> <match> <seed>)
268</enscript>
269
270where {{<from-index>}} is the index from where we started searching
271(initially {{<start>}} and thereafter the end index of the last
272match), {{<match>}} is the resulting match-data object, and {{<seed>}}
273is the accumulated fold result starting with {{<knil>}}.
274
275The rationale for providing the {{<from-index>}} (which is not
276provided in the SCSH {{regexp-fold}} utility), is because this
277information is useful (e.g. for extracting the unmatched portion of
278the string before the current match, as needed in
279{{irregex-replace}}), and not otherwise directly accessible.
280
281The optional {{<finish>}} takes two arguments:
282
283<enscript highlight=scheme>
284(<finish> <from-index> <seed>)
285</enscript>
286
287which simiarly allows you to pick up the unmatched tail of the string,
288and defaults to just returning the {{<seed>}}.
289
290{{<start>}} and {{<end>}} are numeric indices letting you specify the
291boundaries of the string on which you want to fold.
292
293To extract all instances of a match out of a string, you can use
294
295<enscript highlight=scheme>
296(map irregex-match-substring
297     (irregex-fold <irx>
298                   (lambda (i m s) (cons m s))
299                   '()
300                   <str>
301                   (lambda (i s) (reverse s))))
302</enscript>
303
304Note if an empty match is found {{<kons>}} will be called on that
305empty string, and to avoid an infinite loop matching will resume at
306the next char.  It is up to the programmer to do something sensible
307with the skipped char in this case.
308
309
310=== Extended SRE Syntax
311
312Irregex provides the first native implementation of SREs (Scheme
313Regular Expressions), and includes many extensions necessary both for
314minimal POSIX compatibility, as well as for modern extensions found in
315libraries such as PCRE.
316
317The following table summarizes the SRE syntax, with detailed
318explanations following.
319
320  ;; basic patterns
321  <string>                          ; literal string
322  (seq <sre> ...)                   ; sequence
323  (: <sre> ...)
324  (or <sre> ...)                    ; alternation
325 
326  ;; optional/multiple patterns
327  (? <sre> ...)                     ; 0 or 1 matches
328  (* <sre> ...)                     ; 0 or more matches
329  (+ <sre> ...)                     ; 1 or more matches
330  (= <n> <sre> ...)                 ; exactly <n> matches
331  (>= <n> <sre> ...)                ; <n> or more matches
332  (** <from> <to> <sre> ...)        ; <n> to <m> matches
333  (?? <sre> ...)                    ; non-greedy (non-greedy) pattern: (0 or 1)
334  (*? <sre> ...)                    ; non-greedy kleene star
335  (**? <from> <to> <sre> ...)       ; non-greedy range
336 
337  ;; submatch patterns
338  (submatch <sre> ...)              ; numbered submatch
339  ($ <sre> ...)
340  (submatch-named <name> <sre> ...) ; named submatch
341  (=> <name> <sre> ...)
342  (backref <n-or-name>)             ; match a previous submatch
343 
344  ;; toggling case-sensitivity
345  (w/case <sre> ...)                ; enclosed <sre>s are case-sensitive
346  (w/nocase <sre> ...)              ; enclosed <sre>s are case-insensitive
347 
348  ;; character sets
349  <char>                            ; singleton char set
350  (<string>)                        ; set of chars
351  (or <cset-sre> ...)               ; set union
352  (~ <cset-sre> ...)                ; set complement (i.e. [^...])
353  (- <cset-sre> ...)                ; set difference
354  (& <cset-sre> ...)                ; set intersection
355  (/ <range-spec> ...)              ; pairs of chars as ranges
356 
357  ;; named character sets
358  any
359  nonl
360  ascii
361  lower-case     lower
362  upper-case     upper
363  alphabetic     alpha
364  numeric        num
365  alphanumeric   alphanum  alnum
366  punctuation    punct
367  graphic        graph
368  whitespace     white     space
369  printing       print
370  control        cntrl
371  hex-digit      xdigit
372 
373  ;; assertions and conditionals
374  bos eos                           ; beginning/end of string
375  bol eol                           ; beginning/end of line
376  bow eow                           ; beginning/end of word
377  nwb                               ; non-word-boundary
378  (look-ahead <sre> ...)            ; zero-width look-ahead assertion
379  (look-behind <sre> ...)           ; zero-width look-behind assertion
380  (neg-look-ahead <sre> ...)        ; zero-width negative look-ahead assertion
381  (neg-look-behind <sre> ...)       ; zero-width negative look-behind assertion
382  (atomic <sre> ...)                ; for (?>...) independent patterns
383  (if <test> <pass> [<fail>])       ; conditional patterns
384  commit                            ; don't backtrack beyond this (i.e. cut)
385 
386  ;; backwards compatibility
387  (posix-string <string>)           ; embed a POSIX string literal
388
389==== Basic SRE Patterns
390
391The simplest SRE is a literal string, which matches that string
392exactly.
393
394<enscript highlight=scheme>
395(irregex-search "needle" "hayneedlehay") => #<match>
396</enscript>
397
398By default the match is case-sensitive, though you can control this
399either with the compiler flags or local overrides:
400
401<enscript highlight=scheme>
402(irregex-search "needle" "haynEEdlehay") => #f
403
404(irregex-search (irregex "needle" 'i) "haynEEdlehay") => #<match>
405
406(irregex-search '(w/nocase "needle") "haynEEdlehay") => #<match>
407</enscript>
408
409You can use {{w/case}} to switch back to case-sensitivity inside a
410{{w/nocase}} or when the SRE was compiled with {{'i}}:
411
412<enscript highlight=scheme>
413(irregex-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => #<match>
414
415(irregex-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f
416</enscript>
417
418''Important:'' characters outside the ASCII range (ie, UTF8 chars) are
419'''not''' matched case insensitively!
420
421Of course, literal strings by themselves aren't very interesting
422regular expressions, so we want to be able to compose them.  The most
423basic way to do this is with the {{seq}} operator (or its abbreviation
424{{:}}), which matches one or more patterns consecutively:
425
426<enscript highlight=scheme>
427(irregex-search '(: "one" space "two" space "three") "one two three") => #<match>
428</enscript>
429
430As you may have noticed above, the {{w/case}} and {{w/nocase}}
431operators allowed multiple SREs in a sequence - other operators that
432take any number of arguments (e.g. the repetition operators below)
433allow such implicit sequences.
434
435To match any one of a set of patterns use the {{or}} alternation
436operator:
437
438<enscript highlight=scheme>
439(irregex-search '(or "eeney" "meeney" "miney") "meeney") => #<match>
440
441(irregex-search '(or "eeney" "meeney" "miney") "moe") => #f
442</enscript>
443
444==== SRE Repetition Patterns
445
446There are also several ways to control the number of times a pattern
447is matched.  The simplest of these is {{?}} which just optionally
448matches the pattern:
449
450<enscript highlight=scheme>
451(irregex-search '(: "match" (? "es") "!") "matches!") => #<match>
452
453(irregex-search '(: "match" (? "es") "!") "match!") => #<match>
454
455(irregex-search '(: "match" (? "es") "!") "matche!") => #f
456</enscript>
457
458To optionally match any number of times, use {{*}}, the Kleene star:
459
460<enscript highlight=scheme>
461(irregex-search '(: "<" (* (~ #\>)) ">") "<html>") => #<match>
462
463(irregex-search '(: "<" (* (~ #\>)) ">") "<>") => #<match>
464
465(irregex-search '(: "<" (* (~ #\>)) ">") "<html") => #f
466</enscript>
467
468Often you want to match any number of times, but at least one time is
469required, and for that you use {{+}}:
470
471<enscript highlight=scheme>
472(irregex-search '(: "<" (+ (~ #\>)) ">") "<html>") => #<match>
473
474(irregex-search '(: "<" (+ (~ #\>)) ">") "<a>") => #<match>
475
476(irregex-search '(: "<" (+ (~ #\>)) ">") "<>") => #f
477</enscript>
478
479More generally, to match at least a given number of times, use {{>=}}:
480
481<enscript highlight=scheme>
482(irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<table>") => #<match>
483
484(irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<pre>") => #<match>
485
486(irregex-search '(: "<" (>= 3 (~ #\>)) ">") "<tr>") => #f
487</enscript>
488
489To match a specific number of times exactly, use {{=}}:
490
491<enscript highlight=scheme>
492(irregex-search '(: "<" (= 4 (~ #\>)) ">") "<html>") => #<match>
493
494(irregex-search '(: "<" (= 4 (~ #\>)) ">") "<table>") => #f
495</enscript>
496
497And finally, the most general form is {{**}} which specifies a range
498of times to match.  All of the earlier forms are special cases of this.
499
500<enscript highlight=scheme>
501(irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => #<match>
502
503(irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f
504</enscript>
505
506There are also so-called "non-greedy" variants of these repetition
507operators, by convention suffixed with an additional {{?}}.  Since the
508normal repetition patterns can match any of the allotted repetition
509range, these operators will match a string if and only if the normal
510versions matched.  However, when the endpoints of which submatch
511matched where are taken into account (specifically, all matches when
512using irregex-search since the endpoints of the match itself matter),
513the use of a non-greedy repetition can change the result.
514
515So, whereas {{?}} can be thought to mean "match or don't match,"
516{{??}} means "don't match or match."  {{*}} typically consumes as much
517as possible, but {{*?}} tries first to match zero times, and only
518consumes one at a time if that fails.  If you have a greedy operator
519followed by a non-greedy operator in the same pattern, they can
520produce surprisins results as they compete to make the match longer or
521shorter.  If this seems confusing, that's because it is.  Non-greedy
522repetitions are defined only in terms of the specific backtracking
523algorithm used to implement them, which for compatibility purposes
524always means the Perl algorithm.  Thus, when using these patterns you
525force IrRegex to use a backtracking engine, and can't rely on
526efficient execution.
527
528==== SRE Character Sets
529
530Perhaps more common than matching specific strings is matching any of
531a set of characters.  You can use the {{or}} alternation pattern on a
532list of single-character strings to simulate a character set, but this
533is too clumsy for everyday use so SRE syntax allows a number of
534shortcuts.
535
536A single character matches that character literally, a trivial
537character class.  More conveniently, a list holding a single element
538which is a string refers to the character set composed of every
539character in the string.
540
541<enscript highlight=scheme>
542(irregex-match '(* #\-) "---") => #<match>
543
544(irregex-match '(* #\-) "-_-") => #f
545
546(irregex-match '(* ("aeiou")) "oui") => #<match>
547
548(irregex-match '(* ("aeiou")) "ouais") => #f
549</enscript>
550
551Ranges are introduced with the {{/}} operator.  Any strings or
552characters in the {{/}} are flattened and then taken in pairs to
553represent the start and end points, inclusive, of character ranges.
554
555<enscript highlight=scheme>
556(irregex-match '(* (/ "AZ09")) "R2D2") => #<match>
557
558(irregex-match '(* (/ "AZ09")) "C-3PO") => #f
559</enscript>
560
561In addition, a number of set algebra operations are provided.  {{or}},
562of course, has the same meaning, but when all the options are
563character sets it can be thought of as the set union operator.  This
564is further extended by the {{&}} set intersection, {{-}} set
565difference, and {{~}} set complement operators.
566
567<enscript highlight=scheme>
568(irregex-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => #<match>
569
570(irregex-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f
571
572(irregex-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => #<match>
573
574(irregex-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f
575</enscript>
576
577==== SRE Assertion Patterns
578
579There are a number of times it can be useful to assert something about
580the area around a pattern without explicitly making it part of the
581pattern.  The most common cases are specifically anchoring some
582pattern to the beginning or end of a word or line or even the whole
583string.  For example, to match on the end of a word:
584
585<enscript highlight=scheme>
586(irregex-search '(: "foo" eow) "foo") => #<match>
587
588(irregex-search '(: "foo" eow) "foo!") => #<match>
589
590(irregex-search '(: "foo" eow) "foof") => #f
591</enscript>
592
593The {{bow}}, {{bol}}, {{eol}}, {{bos}} and {{eos}} work similarly.
594{{nwb}} asserts that you are not in a word-boundary - if replaced for
595{{eow}} in the above examples it would reverse all the results.
596
597There is no {{wb}}, since you tend to know from context whether it
598would be the beginning or end of a word, but if you need it you can
599always use {{(or bow eow)}}.
600
601Somewhat more generally, Perl introduced positive and negative
602look-ahead and look-behind patterns.  Perl look-behind patterns are
603limited to a fixed length, however the IrRegex versions have no such
604limit.
605
606<enscript highlight=scheme>
607(irregex-search '(: "regular" (look-ahead " expression"))
608                "regular expression")
609 => #<match>
610</enscript>
611
612The most general case, of course, would be an {{and}} pattern to
613complement the {{or}} pattern - all the patterns must match or the
614whole pattern fails.  This may be provided in a future release,
615although it (and look-ahead and look-behind assertions) are unlikely
616to be compiled efficiently.
617
618==== SRE Utility Patterns
619
620The following utility regular expressions are also provided for common
621patterns that people are eternally reinventing.  They are not
622necessarily the official patterns matching the RFC definitions of the
623given data, because of the way that such patterns tend to be used.
624There are three general usages for regexps:
625
626; searching : search for a pattern matching a desired object in a larger text
627
628; validation : determine whether an entire string matches a pattern
629
630; extraction : given a string already known to be valid, extract certain fields from it as submatches
631
632In some cases, but not always, these will overlap.  When they are
633different, {{irregex-search}} will naturally always want the searching
634version, so IrRegex provides that version.
635
636As an example where these might be different, consider a URL.  If you
637want to match all the URLs in some arbitrary text, you probably want
638to exclude a period or comma at the tail end of a URL, since it's more
639likely being used as punctuation rather than part of the URL, despite
640the fact that it would be valid URL syntax.
641
642Another problem with the RFC definitions is the standard itself may
643have become irrelevant.  For example, the pattern IrRegex provides for
644email addresses doesn't match quoted local parts (e.g.
645{{"first last"@domain.com}}) because these are increasingly rare, and
646unsupported by enough software that it's better to discourage their use.
647Conversely, technically consecutive periods
648(e.g. {{first..last@domain.com}}) are not allowed in email addresses, but
649most email software does allow this, and in fact such addresses are
650quite common in Japan.
651
652The current patterns provided are:
653
654  newline                        ; general newline pattern (crlf, cr, lf)
655  integer                        ; an integer
656  real                           ; a real number (including scientific)
657  string                         ; a "quoted" string
658  symbol                         ; an R5RS Scheme symbol
659  ipv4-address                   ; a numeric decimal ipv4 address
660  ipv6-address                   ; a numeric hexadecimal ipv6 address
661  domain                         ; a domain name
662  email                          ; an email address
663  http-url                       ; a URL beginning with https?://
664
665Because of these issues the exact definitions of these patterns are
666subject to be changed, but will be documented clearly when they are
667finalized.  More common patterns are also planned, but as what you
668want increases in complexity it's probably better to use a real
669parser.
670
671=== Supported PCRE Syntax
672
673Since the PCRE syntax is so overwhelming complex, it's easier to just
674list what we *don't* support for now.  Refer to the
675[[http://pcre.org/pcre.txt|PCRE documentation]] for details.  You
676should be using the SRE syntax anyway!
677
678Unicode character classes ({{\P}}) are not supported, but will be
679in an upcoming release.  {{\C}} named characters are not supported.
680
681Callbacks, subroutine patterns and recursive patterns are not
682supported.  ({{*FOO}}) patterns are not supported and may never be.
683
684{{\G}} and {{\K}} are not supported.
685
686Octal character escapes are not supported because they are ambiguous
687with back-references - just use hex character escapes.
688
689Other than that everything should work, including named submatches,
690zero-width assertions, conditional patterns, etc.
691
692In addition, {{\<}} and {{\>}} act as beginning-of-word and end-of-word
693marks, respectively, as in Emacs regular expressions.
694
695Also, two escapes are provided to embed SRE patterns inside PCRE
696strings, {{"\'<sre>"}} and {{"(*'<sre>)"}}.  For example, to match a
697comma-delimited list of integers you could use
698
699<enscript highlight=scheme>
700"\\'integer(,\\'integer)*"
701</enscript>
702
703and to match a URL in angle brackets you could use
704
705<enscript highlight=scheme>
706"<('*http-url)>"
707</enscript>
708
709Note in the second example the enclosing {{"('*...)"}} syntax is needed
710because the Scheme reader would consider the closing {{">"}} as part of
711the SRE symbol.
712
713The following chart gives a quick reference from PCRE form to the SRE
714equivalent:
715
716  ;; basic syntax
717  "^"                     ;; bos (or eos inside (?m: ...))
718  "$"                     ;; eos (or eos inside (?m: ...))
719  "."                     ;; nonl
720  "a?"                    ;; (? a)
721  "a*"                    ;; (* a)
722  "a+"                    ;; (+ a)
723  "a??"                   ;; (?? a)
724  "a*?"                   ;; (*? a)
725  "a+?"                   ;; (+? a)
726  "a{n,m}"                ;; (** n m a)
727
728  ;; grouping
729  "(...)"                 ;; (submatch ...)
730  "(?:...)"               ;; (: ...)
731  "(?i:...)"              ;; (w/nocase ...)
732  "(?-i:...)"             ;; (w/case ...)
733  "(?<name>...)"          ;; (=> <name>...)
734
735  ;; character classes
736  "[aeiou]"               ;; ("aeiou")
737  "[^aeiou]"              ;; (~ "aeiou")
738  "[a-z]"                 ;; (/ "az") or (/ "a" "z")
739  "[[:alpha:]]"           ;; alpha
740
741  ;; assertions
742  "(?=...)"               ;; (look-ahead ...)
743  "(?!...)"               ;; (neg-look-ahead ...)
744  "(?<=...)"              ;; (look-behind ...)
745  "(?<!...)"              ;; (neg-look-behind ...)
746  "(?(test)pass|fail)"    ;; (if test pass fail)
747  "(*COMMIT)"             ;; commit
748
749=== Chunked String Matching
750
751It's often desirable to perform regular expression matching over
752sequences of characters not represented as a single string.  The most
753obvious example is a text-buffer data structure, but you may also want
754to match over lists or trees of strings (i.e. ropes), over only
755certain ranges within a string, over an input port, etc.  With
756existing regular expression libraries, the only way to accomplish this
757is by converting the abstract sequence into a freshly allocated
758string.  This can be expensive, or even impossible if the object is a
759text-buffer opened onto a 500MB file.
760
761IrRegex provides a chunked string API specifically for this purpose.
762You define a chunking API with {{make-irregex-chunker}}:
763
764==== make-irregex-chunker
765
766<procedure>(make-irregex-chunker <get-next> <get-string> [<get-start> <get-end> <get-substring> <get-subchunk>])</procedure>
767
768where
769
770{{(<get-next> chunk) => }} returns the next chunk, or {{#f}} if there are no more chunks
771
772{{(<get-string> chunk) => }} a string source for the chunk
773
774{{(<get-start> chunk) => }} the start index of the result of {{<get-string>}} (defaults to always 0)
775
776{{(<get-end> chunk) => }} the end (exclusive) of the string (defaults to {{string-length}} of the source string)
777
778{{(<get-substring> cnk1 i cnk2 j) => }} a substring for the range between the chunk {{cnk1}} starting at index {{i}} and ending at {{cnk2}} at index {{j}}
779
780{{(<get-subchunk> cnk1 i cnk2 j) => }} as above but returns a new chunked data type instead of a string (optional)
781
782There are two important constraints on the {{<get-next>}} procedure.
783It must return an {{eq?}} identical object when called multiple times
784on the same chunk, and it must not return a chunk with an empty string
785(start == end).  This second constraint is for performance reasons -
786we push the work of possibly filtering empty chunks to the chunker
787since there are many chunk types for which empty strings aren't
788possible, and this work is thus not needed.  Note that the initial
789chunk passed to match on is allowed to be empty.
790
791{{<get-substring>}} is provided for possible performance improvements
792- without it a default is used.  {{<get-subchunk>}} is optional -
793without it you may not use {{irregex-match-subchunk}} described above.
794
795You can then match chunks of these types with the following
796procedures:
797
798==== irregex-search/chunked
799==== irregex-match/chunked
800
801<procedure>(irregex-search/chunked <irx> <chunker> <chunk> [<start>])</procedure><br>
802<procedure>(irregex-match/chunked <irx> <chunker> <chunk> [<start>])</procedure>
803
804These return normal match-data objects.
805
806Example:
807
808To match against a simple, flat list of strings use:
809
810<enscript highlight=scheme>
811  (define (rope->string rope1 start rope2 end)
812    (if (eq? rope1 rope2)
813        (substring (car rope1) start end)
814        (let loop ((rope (cdr rope1))
815                   (res (list (substring (car rope1) start))))
816           (if (eq? rope rope2)
817               (string-concatenate-reverse      ; from SRFI-13
818                (cons (substring (car rope) 0 end) res))
819               (loop (cdr rope) (cons (car rope) res))))))
820
821  (define rope-chunker
822    (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x)))
823                          car
824                          (lambda (x) 0)
825                          (lambda (x) (string-length (car x)))
826                          rope->string))
827
828  (irregex-search/chunked <pat> rope-chunker <list-of-strings>)
829</enscript>
830
831Here we are just using the default start, end and substring behaviors,
832so the above chunker could simply be defined as:
833
834<enscript highlight=scheme>
835  (define rope-chunker
836    (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) car))
837</enscript>
838
839==== irregex-fold/chunked
840
841<procedure>(irregex-fold/chunked <irx> <kons> <knil> <chunker> <chunk> [<finish> [<start-index>]])</procedure>
842
843Chunked version of {{irregex-fold}}.
844
845=== Utilities
846
847The following procedures are also available.
848
849==== irregex-quote
850
851<procedure>(irregex-quote <str>)</procedure>
852
853Returns a new string with any special regular expression characters
854escaped, to match the original string literally in POSIX regular
855expressions.
856
857==== irregex-opt
858
859<procedure>(irregex-opt <list-of-strings>)</procedure>
860
861Returns an optimized SRE matching any of the literal strings
862in the list, like Emacs' {{regexp-opt}}.  Note this optimization
863doesn't help when irregex is able to build a DFA.
864
865==== sre->string
866
867<procedure>(sre->string <sre>)</procedure>
868
869Convert an SRE to a PCRE-style regular expression string, if
870possible.
871
872
873---
874Previous: [[Module (chicken io)]]
875
876Next: [[Module (chicken keyword)]]
Note: See TracBrowser for help on using the repository browser.