source: project/release/5/uri-generic/trunk/alternatives/README

Last change on this file was 33655, checked in by sjamaan, 2 years ago

Update prcc notes with current timings

Apparently the maybe/list hack sped things up a bit? Or perhaps it
was some other tweak.

File size: 8.2 KB
Line 
1Here are some alternative implementations of uri-generic, using
2various parsing libraries.  They all passed the full testsuite at the
3time they were written.  Occasionally they are updated to make them
4work again with the latest testsuite.
5
6The original uri-generic implementation (based on matchable) is a bit
7hard to read, but it is by far the most efficient. For readability I
8prefer the irregex one, based on composable Scheme Regular
9Expressions.  It is about average in performance compared to the other
10implementations and it will hopefully improve when irregex is optimized.
11
12Here are some notes I made after benchmarking and implementing each version:
13(benchmarks on a Lenovo X230, Chicken 4.11.0, which ships w/ irregex 0.9.4)
14
15The benchmark can be done like this:
16
17(load "uri-generic.irregex.so") ;; Use the compiled version!
18(load "../uri-generic.import.scm")
19(import uri-generic)
20(time (let loop ((i 0)) (unless (> i 10000) (uri-reference "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1 i)))))
21
22You can compile with:
23
24for i in uri-generic.*.scm; do echo Compiling $i; csc -s -O3 $i; done
25
26Test with csi -s test.scm after changing it for the tested implementation.
27
28The +/- behind the LOC indicates that some parts are implemented using
29irregexes.  If those parts would be converted to get rid of the irregex
30dependency, the lines of code would change (generally increase)
31
32==============================================================================
33
34Current main implementation: based on matchable [[1284 LOC]]
35
360.276s CPU time, 0.036s GC time (major), 1320148/7492 mutations (total/tracked), 10/1653 GCs (major/minor)
37
38NOTES:
39
40Implementation is a bit hard to read & understand even if you know how
41it works. It is also the largest implementation (in lines of code).
42But it's faaaaaast! Turns out that converting a string to a list of chars
43once and then using an optimizing macro to match the list against expected
44chars is a very efficient way to do parsing :)
45
46------------------------------------------------------------------------------
47
48Implementation based on abnf [[966 LOC]]
49
501.392s CPU time, 0.112s GC time (major), 2980315/5741 mutations (total/tracked), 27/11011 GCs (major/minor)
51
52NOTES:
53
54abnf has undergone a few major API changes, which meant this
55alternative implementation required a massive overhaul.  I'd be
56hesitant to rely on a parsing egg which changes so radically.  On the
57other hand, it may have stabilised by now.
58
59The API is pretty easy to use, and the resulting parsers rather
60readable.
61
62Originally abnf appeared to be faster than comparse, but we were
63comparing apples and oranges: the abnf implementation of
64uri-decode-string was still using the irregex-based version, which
65turns out to be quite the bottleneck.
66
67------------------------------------------------------------------------------
68
69Implementation based on comparse [[893 LOC]]
70
710.84s CPU time, 0.116s GC time (major), 1870231/55987 mutations (total/tracked), 43/6724 GCs (major/minor)
72
73NOTES:
74
75Reasonably easy to use and the resulting parser is pretty readable.
76It used to be very slow, but performance has been worked on a lot, so
77now it's more acceptable.
78
79In many respects it is very similar to abnf; obviously I prefer the
80license of comparse, but overall abnf is more readable.  The learning
81curve of the current version of abnf is slightly steeper than comparse
82(the old one being quite hard to use), but once you "get" it it's easy
83enough.  This may just be a documentation issue and the confusion of
84having to deal with the API of 2 eggs (lexgen/typeclass) rather than
85an inherent problem.
86
87------------------------------------------------------------------------------
88
89irregex-based implementation [[859 LOC]]
90
910.896s CPU time, 0.024s GC time (major), 600077/2761 mutations (total/tracked), 24/3076 GCs (major/minor)
92
93NOTES:
94
95Very concise but still very readable syntax when using named matches.
96I especially like that built-in regexes like hex-digit and such are all
97quoted, whereas custom bits are unquoted.  It's an immediate indication
98that the definition is to be found elsewhere in the code (AFAIK, this
99is not extendable so you can't make libraries of regexes in the
100same way; at that point everything of the library becomes unquoted too).
101
102The 0.8.1 version of irregex which shipped with Chickens up till 4.6.1
103was extremely slow (this benchmark took over 50 seconds to run), but
104with irregex 0.8.2, the irregex implementation performs similarly to
105the matchable implementation (but it's still slightly slower).
106
107------------------------------------------------------------------------------
108
109packrat-based implementation [[997 LOC]]
110
1111.972s CPU time, 0.296s GC time (major), 3110328/513298 mutations (total/tracked), 273/10605 GCs (major/minor)
112
113NOTES:
114
115In compiled mode it's the slowest of the bunch, and it's extremely
116clumsy.  Do not use, avoid at all costs, etc.  Why?  Many reasons:
117
118It is extremely annoying that the macro requires you to provide a body
119expression.  There is no automatic way to let it just resolve to
120whatever the lower nonterminals resolve to, nor to combine code from
121a sequence of expressions so you have to spell it out every time.
122
123In general, the macro just feels clumsy and you have to keep the
124specification next to your code all the time because it feels so
125unnatural (requiring, for my feeling, superfluous sets of parentheses).
126
127The macro does not allow escaping to Scheme quickly to produce a
128dynamically generated parser in-line.  You can define procedures
129outside of the macro which can be called as "nonterminals" but you
130can't, for example, parameterize a parser on-the-fly.
131
132The macro is much more like plain old BNF or normal yacc.  This means
133that for simple things like optional arguments you need to make a
134separate rule containing both the argument and an empty sequence.
135Quite annoying, after you've worked with any of the above parser libs
136where that's no problem.  Perhaps I just misunderstood how to use
137the (/ ...) rule, but I couldn't get it to work.
138
139Debugging is pretty terrible, due to the definitions being wrapped in
140one large macro.  Because everything is wrapped up in one macro, you
141can't easily reuse subparsers, unless you rip them out and put them in
142their own macro.  This makes it completely uncomposable.
143
144Possibly things are better when you just use the combinators instead,
145not using the macro. The advantage of the macro is that the resulting
146ruleset is quite readable.
147
148------------------------------------------------------------------------------
149
150prcc-based implementation [[923 LOC]]
151
15233.26s CPU time, 1.72s GC time (major), 63413798/28125899 mutations (total/tracked), 919/263952 GCs (major/minor)
153
154NOTES:
155
156At first this looks like a very limited library, but once you start
157working with it you realise that all the important combinators are
158there.  It's a very clean API if you ignore all the <xyz> aliases.  It
159only supplies what's needed, and you can easily build additional
160combinators on it.
161
162The documentation was a bit lacking, but a dive into the source code
163was enough to figure out most things (and now I've extended the docs a
164bit with that knowledge).
165
166I used the comparse library as a basis, which was easy enough to
167convert to prcc.  Found a couple of bugs, which were fixed within
168hours by the very responsive maintainer, even though the egg was from
1692012 and had only one release!
170
171A real disadvantage is that parse errors will always be written to the
172current output port.  There seems to be no way to supply custom error
173handling.  The error handling itself is pretty decent in that it keeps
174track of input position and it knows what was expected when the error
175occurred.  Unfortunately, for more complex parsers this doesn't work
176perfectly: (parse-string "qux" (sel (str "foo") (str "bar"))) results
177in an error that says "bar was expected".  On the other hand, this
178would be easy to customize by using "act" with a custom failure
179handler.
180
181It's obviously too slow for real-world use, probably because it uses
182srfi-41 which is not known to be efficient at all.  It could probably
183be converted to lazy-seq, which should speed things up, but I haven't
184done any serious performance comparisons.
185
186=========================================================================
187
188Other notes that people interested in Scheme parsing might want to read
189(by zbigniew): http://3e8.org/zb/eggs/lexing.txt
Note: See TracBrowser for help on using the repository browser.