source: project/wiki/eggref/4/lexgen @ 30850

Last change on this file since 30850 was 30850, checked in by Ivan Raikov, 7 years ago

lexgen doc: added a note on the behavior of bind on an empty input stream [thanks to Peter Bex].

File size: 12.0 KB
Line 
1[[tags: eggs]]
2[[toc:]]
3
4== lexgen
5
6=== Description
7
8{{lexgen}} is a lexer generator comprised in its core of only five
9small procedures that can be combined to form pattern matchers.
10
11A pattern matcher procedure takes an input stream, and returns a new
12stream advanced by the pattern.
13
14A stream is defined as a list that contains a list of characters
15consumed by the pattern matcher, and a list of characters not yet
16consumed. E.g., the list
17
18  ((#\a) (#\b #\c #\d #\e))
19
20represents a stream that contains the consumed character a, and the
21unconsumed characters b c d e.
22
23A pattern matcher has the form of a procedure that takes a success
24continuation, which is invoked when the pattern matches and the stream
25is advanced, an error continuation, which is invoked when the pattern
26does not match, and an input stream.
27
28=== Library Procedures
29
30Every combinator procedure in this library returns a procedure that
31takes in a success continuation, error continuation and input stream
32as arguments.
33
34==== Basic procedures
35
36<procedure>(seq MATCHER1 MATCHER2) => MATCHER</procedure>
37
38{{seq}} builds a matcher that matches a sequence of patterns.
39
40<procedure>(bar MATCHER1 MATCHER2) => MATCHER</procedure>
41
42{{bar}} matches either of two patterns. It's analogous to patterns
43separated by {{|}} in traditional regular expressions.
44
45<procedure>(star MATCHER) => MATCHER</procedure>
46
47{{star}} is an implementation of the Kleene closure. It is analogous
48to {{*}} in traditional regular expressions.
49
50==== Token procedure
51
52<procedure>(tok <Input>) => (LAMBDA TOKEN PROC) => MATCHER</procedure>
53
54Procedure {{tok}} builds pattern matchers based on character
55comparison operations. It is intended for matching input sequences of
56arbitrary kinds, e.g. character lists, strings, or other kinds of
57sequences. To achieve abstraction over the input sequence kind,
58{{tok}} is parameterised on a type class named {{<Input>}}. Please see
59libraries [[typeclass]] and [[input-classes]] for information on the
60type class interface.
61
62As an example, the code below creates an input class for character
63lists and defines a version of {{tok}} specialized for character
64lists.
65
66<enscript highlight="scheme">
67(require-extension typeclass input-classes)
68
69(define char-list-<Input>
70  (make-<Input> null? car cdr))
71
72(define char-list-tok (tok <char-list-<Input>))
73</enscript>
74
75Once applied to an input class, {{tok}} builds a pattern matcher that,
76for each stream given, applies a procedure to the given token
77{{TOKEN}} and an input character. If the procedure returns a true
78value, that value is prepended to the list of consumed elements, and
79the input character is removed from the list of input elements.
80
81==== {{<CharLex>}} type class and related procedures
82
83This library provides several procedures for character matching based
84on the {{tok}} procedure. These procedures are enumerated as the
85fields of another typeclas, {{<CharLex>}}, which inherits from the
86{{<Token>}} typeclass:
87
88<enscript highlight="scheme">
89 (define-class <CharLex> (<Token> T)  char set range lit)
90</enscript>
91
92The {{<Token>}} typeclass inherits from the {{<Input>}} typeclass and
93contains only the {{tok}} field:
94
95<enscript highlight="scheme">
96 (define-class <Token> (<Input> input)  tok)
97</enscript>
98
99This library provides convenience functions to create instances of
100{{CharLex}} based on different input typeclasses:
101
102<procedure>(Input->Token INPUT-CLASS => TOKEN-CLASS)</procedure>
103
104This procedure takes an instance of the {{<Input>}} typeclass, created
105by the {{make-<Instance>}} constructor shown above, and returns an
106instance of the {{<Token>}} typeclass, which in turn contains an
107instance of {{tok}} specialized for the given input class.
108
109<procedure>(Token->CharLex TOKEN-CLASS => CHARLEX-CLASS)</procedure>
110
111This procedure takes an instance of the {{<Token>}} typeclass, and
112returns an instance of the {{CharLex}} typeclass, which contains the
113following procedures:
114
115<procedure>(char CHAR) => MATCHER</procedure>
116
117Matches a single character.
118
119<procedure>(set CHAR-SET) => MATCHER</procedure>
120
121Matches any of a SRFI-14 set of characters.
122
123<procedure>(range CHAR CHAR) => MATCHER</procedure>
124
125Matches a range of characters. Analogous to character class {{[]}}.
126
127<procedure>(lit STRING) => MATCHER</procedure>
128
129Matches a literal string {{s}}.
130
131==== Convenience procedures
132
133These procedures are built from the basic procedures and are provided
134for convenience.
135
136<procedure>(try PROC) => PROC</procedure>
137
138Converts a binary predicate procedure to a binary procedure that
139returns its right argument when the predicate is true, and false
140otherwise.
141
142<procedure>(lst MATCHER-LIST) => MATCHER</procedure>
143
144Constructs a matcher for the sequence of matchers in {{MATCHER-LIST}}.
145
146<procedure>(pass) => MATCHER</procedure>
147
148This matcher returns without consuming any input.
149
150<procedure>(pos MATCHER) => MATCHER</procedure>
151
152Positive closure. Analogous to {{+}}.
153
154<procedure>(opt MATCHER) => MATCHER</procedure>
155
156Optional pattern. Analogous to {{?}}.
157
158<procedure>(bind F P) => MATCHER</procedure>
159
160Given a rule {{P}} and function {{F}}, returns a matcher that first
161applies {{P}} to the input stream, then applies {{F}} to the returned
162list of consumed tokens, and returns the result and the remainder of
163the input stream.
164
165Note: this combinator will call its failure continuation if the input
166stream is empty.
167
168<procedure>(rebind F G P) => MATCHER</procedure>
169
170Given a rule {{P}} and procedures {{F}} and {{G}}, returns a matcher
171that first applies {{F}} to the input stream, then applies {{P}} to
172the resulting stream, then applies {{G}} to the resulting list of
173consumed elements and returns the result along with the remainder of
174the input stream.
175
176<procedure>(drop P) => MATCHER</procedure>
177
178Given a rule {{P}}, returns a matcher that always returns an empty
179list of consumed tokens when {{P}} succeeds.
180
181==== Lexer procedure
182
183<procedure>(lex MATCHER ERROR STRING) => CHAR-LIST</procedure>
184
185{{lex}} takes a pattern and a string, turns the string into a list of
186streams (containing one stream), applies the pattern, and returns the
187first possible match. Argument {{ERROR}} is a single-argument
188procedure called when the pattern does not match anything.
189
190=== Examples
191
192==== Creating a lexer specialized for lists of characters
193
194<enscript highlight="scheme">
195(require-extension typeclass input-classes lexgen srfi-1 srfi-14 test)
196
197;; The following definitions create matchers {{char}} {{range}}
198;; {{set}} {{lit}} specialized for lists of characters.
199
200(define char-list-<Input>
201  (make-<Input> null? car cdr))
202
203;; Creates an instance of the <Token> typeclass that is named char-list-<Token>
204
205(define char-list-<Token>
206  (Input->Token char-list-<Input>))
207
208;; Creates an instance of the <CharLex> typeclass that is named char-list-<CharLex>
209
210(define char-list-<CharLex>
211  (Token->CharLex char-list-<Token>))
212
213;; The following declaration imports the fields of the typeclass
214;; instances defined above, and prefixes each identifier with
215;; char-list/.  For example, if the <Token> typeclass defines a field
216;; named range, the import-instance declaration below will create an
217;; identifier named char-list/range.
218
219(import-instance (<Token> char-list-<Token> char-list/)
220                 (<CharLex> char-list-<CharLex> char-list/))
221</enscript>
222
223==== A pattern to match floating point numbers
224
225<enscript highlight="scheme">
226
227;;  A pattern to match floating point numbers.
228;;  "-"?(([0-9]+(\\.[0-9]+)?)|(\\.[0-9]+))([eE][+-]?[0-9]+)?
229
230(define numpat
231  (let* ((digit        (char-list/range #\0 #\9))
232         (digits       (pos digit))
233         (fraction     (seq (char-list/char #\.) digits))
234         (significand  (bar (seq digits (opt fraction)) fraction))
235         (exp          (seq (char-list/set "eE") (seq (opt (char-list/set "+-")) digits)))
236         (sign         (opt (char-list/char #\-))))
237    (seq sign (seq significand (opt exp)))))
238 
239 (define (err s)
240  (print "lexical error on stream: " s)
241  (list))
242
243 (lex numpat err "-123.45e-6")
244</enscript>
245
246==== Tokens with position information
247<enscript highlight="scheme">
248       
249(define-record-type postok
250  (make-postok pos token)
251  postok?
252  (pos        postok-pos )
253  (token      postok-token )
254  )
255
256(define pos? pair?)
257(define pos-row car)
258(define pos-col cdr)
259(define make-pos cons)
260
261(define-record-printer (postok x out)
262  (fprintf out "#<token ~A: ~A>"
263           (postok-pos x)
264           (postok-token x)))
265         
266(define (getpos p)
267  (let ((f (lambda (in) (and (pair? in) (postok-pos (car in)))))
268        (g (lambda (i s) (list (make-postok i (car s))))))
269    (rebind f g p)))
270
271(define pos-<Input>
272  (let ((pos-tail
273         (lambda (strm)
274           (cond ((or (null? strm) (null? (cdr strm)))  '())
275                 (else
276                  (let* ((curtok  (car strm))
277                         (pos0    (postok-pos curtok))
278                         (pos1    (let ((row0 (pos-row pos0))
279                                        (col0 (pos-col pos0)))
280                                    (case (cadr strm)
281                                      ((#\newline)  (make-pos (+ 1 row0) 1))
282                                      ((#\return)   (make-pos row0 1))
283                                      (else         (make-pos row0 (+ 1 col0))))))
284                         (res (cons (make-postok pos1 (cadr strm)) (cddr strm))))
285                    res)))))
286        (pos-null? null?)
287        (pos-head  (compose postok-token car)))
288    (make-<Input> pos-null? pos-head pos-tail)))
289
290(define pos-<Token>
291  (Input->Token pos-<Input>))
292
293(define pos-<CharLex>
294  (Token->CharLex pos-<Token>))
295
296(import-instance (<Token> pos-<Token> pos/)
297                 (<CharLex> pos-<CharLex> pos/))
298
299(define (make-pos-stream strm)
300  (let ((begpos (make-pos 1 1)))
301    `(() ,(cons (make-postok begpos (car strm)) (cdr strm)))))
302 
303(define pos-numpat-stream
304  (make-pos-stream (string->list "-123.45e-6")))
305
306(define pbnumpat
307  (let* ((digit        (pos/range #\0 #\9))
308         (digits       (star digit))
309         (fraction     (seq (pos/char #\.) digits))
310         (significand  (bar (seq digits (opt fraction)) fraction))
311         (exp          (seq (pos/set "eE") (seq (opt (pos/set "+-")) digits)))
312         (sign         (opt (pos/char #\-)) )
313         (pat          (seq (getpos (bind make-sign sign))
314                            (seq (getpos (bind make-significand significand))
315                                 (getpos (bind make-exp (opt exp)))))))
316    pat))
317
318(define (pos-num-parser s)  (car (lex pbnumpat err s)))
319</enscript>
320
321=== Requires
322
323* [[utf8]]
324* [[typeclass]]
325* [[input-classes]]
326
327=== Version History
328
329* 6.1-6.2 Corrected behavior of the tok combinator so that the failure continuation is invoked upon end-of-input [thanks to Chris Salch]
330* 6.0 Using utf8 for char operations
331* 5.2 Ensure test script returns proper exit status
332* 5.0-5.1 Added error continuation to the matcher interface and eliminated multiple stream matching
333* 4.0 Implemented typeclass interface for abstracting over input sequences
334* 3.8 Added procedure {{star*}} (greedy Kleene closure matching)
335* 3.6 Added procedure redo [thanks to Christian Kellermann]
336* 3.5 Bug fixes in bind [reported by Peter Bex]
337* 3.3 Bug fixes in stream comparison
338* 3.2 Improved input stream comparison procedures
339* 3.1 Added rebind combinator and stream-unfold procedure
340* 3.0 Added an extension mechanism for input streams of different
341  types (to be elaborated and documented in subsequent versions).
342* 2.6 Added bind and drop combinators
343* 2.5 The seq combinator checks whether the first parser in the sequence has failed
344* 2.4 Added (require-library srfi-1); using lset<= instead of equal? in star
345* 2.3 Bug fix in procedure range; added procedure cps-table
346* 2.2 Bug fix in procedure star
347* 2.1 Added procedure lst
348* 2.0 Core procedures rewritten in continuation-passing style
349* 1.5 Using (require-extension srfi-1)
350* 1.4 Ported to Chicken 4
351* 1.2 Added procedures try and tok (supersedes pred)
352* 1.0 Initial release
353
354=== License
355
356Based on the [[http://www.standarddeviance.com/projects/combinators/combinators.html|SML lexer generator by Thant Tessman]].
357
358  Copyright 2009-2012 Ivan Raikov and the Okinawa Institute of Science and
359  Technology.
360 
361 
362  This program is free software: you can redistribute it and/or modify
363  it under the terms of the GNU General Public License as published by
364  the Free Software Foundation, either version 3 of the License, or
365  (at your option) any later version.
366 
367  This program is distributed in the hope that it will be useful, but
368  WITHOUT ANY WARRANTY; without even the implied warranty of
369  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
370  General Public License for more details.
371 
372  A full copy of the GPL license can be found at
373  <http://www.gnu.org/licenses/>.
374 
Note: See TracBrowser for help on using the repository browser.