source: project/wiki/eggref/4/lexgen @ 30864

Last change on this file since 30864 was 30864, checked in by Ivan Raikov, 7 years ago

updated lexgen version history

File size: 12.4 KB
Line 
1[[tags: eggs]]
2[[toc:]]
3
4== lexgen
5
6=== Description
7
8{{lexgen}} is a lexer generator comprised in its core of only five
9small procedures that can be combined to form pattern matchers.
10
11A pattern matcher procedure takes an input stream, and returns a new
12stream advanced by the pattern.
13
14A stream is defined as a list that contains a list of characters
15consumed by the pattern matcher, and a list of characters not yet
16consumed. E.g., the list
17
18  ((#\a) (#\b #\c #\d #\e))
19
20represents a stream that contains the consumed character a, and the
21unconsumed characters b c d e.
22
23A pattern matcher has the form of a procedure that takes a success
24continuation, which is invoked when the pattern matches and the stream
25is advanced, an error continuation, which is invoked when the pattern
26does not match, and an input stream.
27
28=== Library Procedures
29
30Every combinator procedure in this library returns a procedure that
31takes in a success continuation, error continuation and input stream
32as arguments.
33
34==== Basic procedures
35
36<procedure>(seq MATCHER1 MATCHER2) => MATCHER</procedure>
37
38{{seq}} builds a matcher that matches a sequence of patterns.
39
40<procedure>(bar MATCHER1 MATCHER2) => MATCHER</procedure>
41
42{{bar}} matches either of two patterns. It's analogous to patterns
43separated by {{|}} in traditional regular expressions.
44
45<procedure>(star MATCHER) => MATCHER</procedure>
46
47{{star}} is an implementation of the Kleene closure. It is analogous
48to {{*}} in traditional regular expressions.
49
50==== Token procedure
51
52<procedure>(tok <Input>) => (LAMBDA TOKEN PROC) => MATCHER</procedure>
53
54Procedure {{tok}} builds pattern matchers based on character
55comparison operations. It is intended for matching input sequences of
56arbitrary kinds, e.g. character lists, strings, or other kinds of
57sequences. To achieve abstraction over the input sequence kind,
58{{tok}} is parameterised on a type class named {{<Input>}}. Please see
59libraries [[typeclass]] and [[input-classes]] for information on the
60type class interface.
61
62As an example, the code below creates an input class for character
63lists and defines a version of {{tok}} specialized for character
64lists.
65
66<enscript highlight="scheme">
67(require-extension typeclass input-classes)
68
69(define char-list-<Input>
70  (make-<Input> null? car cdr))
71
72(define char-list-tok (tok <char-list-<Input>))
73</enscript>
74
75Once applied to an input class, {{tok}} builds a pattern matcher that,
76for each stream given, applies a procedure to the given token
77{{TOKEN}} and an input character. If the procedure returns a true
78value, that value is prepended to the list of consumed elements, and
79the input character is removed from the list of input elements.
80
81==== {{<CharLex>}} type class and related procedures
82
83This library provides several procedures for character matching based
84on the {{tok}} procedure. These procedures are enumerated as the
85fields of another typeclas, {{<CharLex>}}, which inherits from the
86{{<Token>}} typeclass:
87
88<enscript highlight="scheme">
89 (define-class <CharLex> (<Token> T)  char set range lit)
90</enscript>
91
92The {{<Token>}} typeclass inherits from the {{<Input>}} typeclass and
93contains only the {{tok}} field:
94
95<enscript highlight="scheme">
96 (define-class <Token> (<Input> input)  tok)
97</enscript>
98
99This library provides convenience functions to create instances of
100{{CharLex}} based on different input typeclasses:
101
102<procedure>(Input->Token INPUT-CLASS => TOKEN-CLASS)</procedure>
103
104This procedure takes an instance of the {{<Input>}} typeclass, created
105by the {{make-<Instance>}} constructor shown above, and returns an
106instance of the {{<Token>}} typeclass, which in turn contains an
107instance of {{tok}} specialized for the given input class.
108
109<procedure>(Token->CharLex TOKEN-CLASS => CHARLEX-CLASS)</procedure>
110
111This procedure takes an instance of the {{<Token>}} typeclass, and
112returns an instance of the {{CharLex}} typeclass, which contains the
113following procedures:
114
115<procedure>(char CHAR) => MATCHER</procedure>
116
117Matches a single character.
118
119<procedure>(set CHAR-SET) => MATCHER</procedure>
120
121Matches any of a SRFI-14 set of characters.
122
123<procedure>(range CHAR CHAR) => MATCHER</procedure>
124
125Matches a range of characters. Analogous to character class {{[]}}.
126
127<procedure>(lit STRING) => MATCHER</procedure>
128
129Matches a literal string {{s}}.
130
131==== Convenience procedures
132
133These procedures are built from the basic procedures and are provided
134for convenience.
135
136<procedure>(try PROC) => PROC</procedure>
137
138Converts a binary predicate procedure to a binary procedure that
139returns its right argument when the predicate is true, and false
140otherwise.
141
142<procedure>(lst MATCHER-LIST) => MATCHER</procedure>
143
144Constructs a matcher for the sequence of matchers in {{MATCHER-LIST}}.
145
146<procedure>(pass) => MATCHER</procedure>
147
148This matcher returns without consuming any input.
149
150<procedure>(pos MATCHER) => MATCHER</procedure>
151
152Positive closure. Analogous to {{+}}.
153
154<procedure>(opt MATCHER) => MATCHER</procedure>
155
156Optional pattern. Analogous to {{?}}.
157
158<procedure>(bind F P) => MATCHER</procedure>
159
160Given a rule {{P}} and function {{F}}, returns a matcher that first
161applies {{P}} to the input stream, then applies {{F}} to the returned
162list of consumed tokens, and returns the result and the remainder of
163the input stream.
164
165Note: this combinator will signal failure if the input stream is
166empty.
167
168<procedure>(bind* F P) => MATCHER</procedure>
169
170The same as {{bind}}, but will signal success if the input stream is
171empty.
172
173<procedure>(rebind F G P) => MATCHER</procedure>
174
175Given a rule {{P}} and procedures {{F}} and {{G}}, returns a matcher
176that first applies {{F}} to the input stream, then applies {{P}} to
177the resulting stream, then applies {{G}} to the resulting list of
178consumed elements and returns the result along with the remainder of
179the input stream.
180
181Note: this combinator will signal failure if the input stream is
182empty.
183
184<procedure>(rebind* F G P) => MATCHER</procedure>
185
186The same as {{rebind}}, but will signal success if the input stream is
187empty.
188
189<procedure>(drop P) => MATCHER</procedure>
190
191Given a rule {{P}}, returns a matcher that always returns an empty
192list of consumed tokens when {{P}} succeeds.
193
194==== Lexer procedure
195
196<procedure>(lex MATCHER ERROR STRING) => CHAR-LIST</procedure>
197
198{{lex}} takes a pattern and a string, turns the string into a list of
199streams (containing one stream), applies the pattern, and returns the
200first possible match. Argument {{ERROR}} is a single-argument
201procedure called when the pattern does not match anything.
202
203=== Examples
204
205==== Creating a lexer specialized for lists of characters
206
207<enscript highlight="scheme">
208(require-extension typeclass input-classes lexgen srfi-1 srfi-14 test)
209
210;; The following definitions create matchers {{char}} {{range}}
211;; {{set}} {{lit}} specialized for lists of characters.
212
213(define char-list-<Input>
214  (make-<Input> null? car cdr))
215
216;; Creates an instance of the <Token> typeclass that is named char-list-<Token>
217
218(define char-list-<Token>
219  (Input->Token char-list-<Input>))
220
221;; Creates an instance of the <CharLex> typeclass that is named char-list-<CharLex>
222
223(define char-list-<CharLex>
224  (Token->CharLex char-list-<Token>))
225
226;; The following declaration imports the fields of the typeclass
227;; instances defined above, and prefixes each identifier with
228;; char-list/.  For example, if the <Token> typeclass defines a field
229;; named range, the import-instance declaration below will create an
230;; identifier named char-list/range.
231
232(import-instance (<Token> char-list-<Token> char-list/)
233                 (<CharLex> char-list-<CharLex> char-list/))
234</enscript>
235
236==== A pattern to match floating point numbers
237
238<enscript highlight="scheme">
239
240;;  A pattern to match floating point numbers.
241;;  "-"?(([0-9]+(\\.[0-9]+)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?
242
243(define numpat
244  (let* ((digit        (char-list/range #\0 #\9))
245         (digits       (pos digit))
246         (fraction     (seq (char-list/char #\.) digits))
247         (significand  (bar (seq digits (opt fraction)) fraction))
248         (exp          (seq (char-list/set "eE") (seq (opt (char-list/set "+-")) digits)))
249         (sign         (opt (char-list/char #\-))))
250    (seq sign (seq significand (opt exp)))))
251 
252 (define (err s)
253  (print "lexical error on stream: " s)
254  (list))
255
256 (lex numpat err "-123.45e-6")
257</enscript>
258
259==== Tokens with position information
260<enscript highlight="scheme">
261       
262(define-record-type postok
263  (make-postok pos token)
264  postok?
265  (pos        postok-pos )
266  (token      postok-token )
267  )
268
269(define pos? pair?)
270(define pos-row car)
271(define pos-col cdr)
272(define make-pos cons)
273
274(define-record-printer (postok x out)
275  (fprintf out "#<token ~A: ~A>"
276           (postok-pos x)
277           (postok-token x)))
278         
279(define (getpos p)
280  (let ((f (lambda (in) (and (pair? in) (postok-pos (car in)))))
281        (g (lambda (i s) (list (make-postok i (car s))))))
282    (rebind f g p)))
283
284(define pos-<Input>
285  (let ((pos-tail
286         (lambda (strm)
287           (cond ((or (null? strm) (null? (cdr strm)))  '())
288                 (else
289                  (let* ((curtok  (car strm))
290                         (pos0    (postok-pos curtok))
291                         (pos1    (let ((row0 (pos-row pos0))
292                                        (col0 (pos-col pos0)))
293                                    (case (cadr strm)
294                                      ((#\newline)  (make-pos (+ 1 row0) 1))
295                                      ((#\return)   (make-pos row0 1))
296                                      (else         (make-pos row0 (+ 1 col0))))))
297                         (res (cons (make-postok pos1 (cadr strm)) (cddr strm))))
298                    res)))))
299        (pos-null? null?)
300        (pos-head  (compose postok-token car)))
301    (make-<Input> pos-null? pos-head pos-tail)))
302
303(define pos-<Token>
304  (Input->Token pos-<Input>))
305
306(define pos-<CharLex>
307  (Token->CharLex pos-<Token>))
308
309(import-instance (<Token> pos-<Token> pos/)
310                 (<CharLex> pos-<CharLex> pos/))
311
312(define (make-pos-stream strm)
313  (let ((begpos (make-pos 1 1)))
314    `(() ,(cons (make-postok begpos (car strm)) (cdr strm)))))
315 
316(define pos-numpat-stream
317  (make-pos-stream (string->list "-123.45e-6")))
318
319(define pbnumpat
320  (let* ((digit        (pos/range #\0 #\9))
321         (digits       (star digit))
322         (fraction     (seq (pos/char #\.) digits))
323         (significand  (bar (seq digits (opt fraction)) fraction))
324         (exp          (seq (pos/set "eE") (seq (opt (pos/set "+-")) digits)))
325         (sign         (opt (pos/char #\-)) )
326         (pat          (seq (getpos (bind make-sign sign))
327                            (seq (getpos (bind make-significand significand))
328                                 (getpos (bind make-exp (opt exp)))))))
329    pat))
330
331(define (pos-num-parser s)  (car (lex pbnumpat err s)))
332</enscript>
333
334=== Requires
335
336* [[utf8]]
337* [[typeclass]]
338* [[input-classes]]
339
340=== Version History
341
342* 7.0 Added bind* and rebind* variants of bind and rebind [thanks to Peter Bex]
343* 6.1-6.2 Corrected behavior of the tok combinator so that the failure continuation is invoked upon end-of-input [thanks to Chris Salch]
344* 6.0 Using utf8 for char operations
345* 5.2 Ensure test script returns proper exit status
346* 5.0-5.1 Added error continuation to the matcher interface and eliminated multiple stream matching
347* 4.0 Implemented typeclass interface for abstracting over input sequences
348* 3.8 Added procedure {{star*}} (greedy Kleene closure matching)
349* 3.6 Added procedure redo [thanks to Christian Kellermann]
350* 3.5 Bug fixes in bind [reported by Peter Bex]
351* 3.3 Bug fixes in stream comparison
352* 3.2 Improved input stream comparison procedures
353* 3.1 Added rebind combinator and stream-unfold procedure
354* 3.0 Added an extension mechanism for input streams of different
355  types (to be elaborated and documented in subsequent versions).
356* 2.6 Added bind and drop combinators
357* 2.5 The seq combinator checks whether the first parser in the sequence has failed
358* 2.4 Added (require-library srfi-1); using lset<= instead of equal? in star
359* 2.3 Bug fix in procedure range; added procedure cps-table
360* 2.2 Bug fix in procedure star
361* 2.1 Added procedure lst
362* 2.0 Core procedures rewritten in continuation-passing style
363* 1.5 Using (require-extension srfi-1)
364* 1.4 Ported to Chicken 4
365* 1.2 Added procedures try and tok (supersedes pred)
366* 1.0 Initial release
367
368=== License
369
370Based on the [[http://www.standarddeviance.com/projects/combinators/combinators.html|SML lexer generator by Thant Tessman]].
371
372  Copyright 2009-2014 Ivan Raikov and the Okinawa Institute of Science and
373  Technology.
374 
375 
376  This program is free software: you can redistribute it and/or modify
377  it under the terms of the GNU General Public License as published by
378  the Free Software Foundation, either version 3 of the License, or
379  (at your option) any later version.
380 
381  This program is distributed in the hope that it will be useful, but
382  WITHOUT ANY WARRANTY; without even the implied warranty of
383  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
384  General Public License for more details.
385 
386  A full copy of the GPL license can be found at
387  <http://www.gnu.org/licenses/>.
388 
Note: See TracBrowser for help on using the repository browser.