source: project/release/4/uri-generic/trunk/alternatives/README @ 30969

Last change on this file since 30969 was 30969, checked in by sjamaan, 6 years ago

Comparse vs abnf: abnf is a little nicer to use and more readable as well as faster. abnf is harder to grok initially

File size: 6.3 KB
Line 
1Here are some alternative implementations of uri-generic, using
2various parsing libraries.  They all passed the full testsuite at the
3time they were written.  Occasionally they are updated to make them
4work again with the latest testsuite.
5
6The original uri-generic implementation (based on matchable) is a bit
7hard to read, but it is by far the most efficient. For readability I
8prefer the irregex one, based on composable Scheme Regular
9Expressions.  It is about average in performance compared to the other
10implementations and it will hopefully improve when irregex is optimized.
11
12Here are some notes I made after benchmarking and implementing each version:
13(benchmarks on a Lenovo X230, Chicken 4.9.1, which ships w/ irregex 0.9.2)
14
15The benchmark can be done like this:
16
17(load "uri-generic.irregex.so") ;; Use the compiled version!
18(load "../uri-generic.import.scm")
19(import uri-generic)
20(time (let loop ((i 0)) (unless (> i 10000) (uri-reference "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1 i)))))
21
22You can compile with:
23
24for i in uri-generic.*.scm; do echo Compiling $i; csc -s -O3 $i; done
25
26Test with csi -s test.scm after changing it for the tested implementation.
27
28The +/- behind the LOC indicates that some parts are implemented using
29irregexes.  If those parts would be converted to get rid of the irregex
30dependency, the lines of code would change (generally increase)
31
32==============================================================================
33
34Current main implementation: based on matchable [[1260 LOC]]
35
360.252s CPU time, 0.028s GC time (major), 1320149/4900 mutations (total/tracked), 15/2139 GCs (major/minor)
37
38NOTES:
39
40Implementation is a bit hard to read & understand even if you know how
41it works. It is also the largest implementation (in lines of code).
42But it's faaaaaast! Turns out that converting a string to a list of chars
43once and then using an optimizing macro to match the list against expected
44chars is a very efficient way to do parsing :)
45
46------------------------------------------------------------------------------
47
48Implementation based on abnf [[943 LOC]]
49
500.552s CPU time, 0.032s GC time (major), 1730190/3654 mutations (total/tracked), 13/4274 GCs (major/minor)
51
52NOTES:
53
54abnf has undergone a few major API changes, which meant this
55alternative implementation required a massive overhaul.  I'd be
56hesitant to rely on a parsing egg which changes so radically.  On the
57other hand, it may have stabilised by now.
58
59The API is pretty easy to use, and the resulting parsers rather
60readable.
61
62------------------------------------------------------------------------------
63
64Implementation based on comparse [[892 LOC]]
65
661.28s CPU time, 0.14s GC time (major), 480090/49717 mutations (total/tracked), 62/9842 GCs (major/minor)
67
68NOTES:
69
70Reasonably easy to use and the resulting parser is pretty readable,
71but not very fast.  I presume it's because it keeps a lot of state
72around for backtracking.
73
74In many respects it is very similar to abnf; obviously I prefer the
75license of comparse, but overall abnf is more readable and faster.
76The learning curve of the current version of abnf is slightly steeper
77than comparse (the old one being quite hard to use), but once you
78"get" it it's easy enough.  This may just be a documentation issue and
79the confusion of having to deal with the API of 2 eggs
80(lexgen/typeclass) rather than an inherent problem.
81
82------------------------------------------------------------------------------
83
84irregex-based implementation [[858 LOC]]
85
860.344s CPU time, 0.032s GC time (major), 600077/2532 mutations (total/tracked), 19/2711 GCs (major/minor)
87
88NOTES:
89
90Very concise but still very readable syntax when using named matches.
91I especially like that built-in regexes like hex-digit and such are all
92quoted, whereas custom bits are unquoted.  It's an immediate indication
93that the definition is to be found elsewhere in the code (AFAIK, this
94is not extendable so you can't make libraries of regexes in the
95same way; at that point everything of the library becomes unquoted too).
96
97The 0.8.1 version of irregex which shipped with Chickens up till 4.6.1
98was extremely slow (this benchmark took over 50 seconds to run), but
99with irregex 0.8.2, the irregex implementation performs similarly to
100the matchable implementation (but it's still slightly slower).
101
102------------------------------------------------------------------------------
103
104packrat-based implementation [[997 LOC]]
105
1062.44s CPU time, 0.964s GC time (major), 3110328/498768 mutations (total/tracked), 556/9723 GCs (major/minor)
107
108NOTES:
109
110In compiled mode it's the slowest of the bunch, and it's extremely
111clumsy.  Do not use, avoid at all costs, etc.  Why?  Many reasons:
112
113It is extremely annoying that the macro requires you to provide a body
114expression.  There is no automatic way to let it just resolve to
115whatever the lower nonterminals resolve to, nor to combine code from
116a sequence of expressions so you have to spell it out every time.
117
118In general, the macro just feels clumsy and you have to keep the
119specification next to your code all the time because it feels so
120unnatural (requiring, for my feeling, superfluous sets of parentheses).
121
122The macro does not allow escaping to Scheme quickly to produce a
123dynamically generated parser in-line.  You can define procedures
124outside of the macro which can be called as "nonterminals" but you
125can't, for example, parameterize a parser on-the-fly.
126
127The macro is much more like plain old BNF or normal yacc.  This means
128that for simple things like optional arguments you need to make a
129separate rule containing both the argument and an empty sequence.
130Quite annoying, after you've worked with any of the above parser libs
131where that's no problem.  Perhaps I just misunderstood how to use
132the (/ ...) rule, but I couldn't get it to work.
133
134Debugging is pretty terrible, due to the definitions being wrapped in
135one large macro.  Because everything is wrapped up in one macro, you
136can't easily reuse subparsers, unless you rip them out and put them in
137their own macro.  This makes it completely uncomposable.
138
139Possibly things are better when you just use the combinators instead,
140not using the macro. The advantage of the macro is that the resulting
141ruleset is quite readable.
142
143=========================================================================
144
145Other notes that people interested in Scheme parsing might want to read
146(by zbigniew): http://3e8.org/zb/eggs/lexing.txt
Note: See TracBrowser for help on using the repository browser.