source: project/wiki/c5-hash-strategy @ 36409

Last change on this file since 36409 was 35966, checked in by kristianlm, 14 months ago

c5 hash strategy: initial draft

File size: 7.9 KB
Line 
1== Suggested Hash Algorithm Conventions for CHICKEN 5
2[[toc:]]
3
4=== Why reinvent the wheel
5Maybe it's just me, but I get really confused every time I try to use
6hashing alrogithms in CHICKEN.
7
8For C5, there's the {{message-digest-primitive}} egg which hash
9algorithms eggs depend on. It doesn't have too many dependencies:
10
11<enscript highlight="scheme">    
12sha2
13 `-- message-digest-primitive
14       `-- check-errors
15</enscript>
16
17But when you want to get the sha256 value of a string, you need the
18{{message-digest-utils}} egg in addition to the {{sha2}} egg, which gives
19you the {{message-digest-byte-vector}} module which gives you the
20{{message-digest-string}} procedure you're after, as well as 11 other
21dependencies. This seems like a little too much.
22
23CHICKEN 4's simple-sha1 and friends were probably written based on the
24same reasoning.
25
26
27==== Solution 1
28Implement a message-digest-string egg which has fewer dependencies
29than message-digest-utils and exports a message-digest-string
30procedure (not too much work).
31
32
33==== Solution 2
34Introduce what's described in this document in new and/or existing
35hash implementations (painful).
36
37
38=== Proposal
39This is CHICKEN 5 hashing strategy proposal which aims to:
40
41* be easy to use
42* make it possible to write hash algorithm eggs without dependencies
43  (looking at you, {{message-digest-primitive}})
44* make it possible to use hash algorithm eggs without dependencies
45  (looking at you, {{message-digest-utils}})
46* put all hash algorithms under a common namespace prefix, eg.
47  {{(import (hash sha256))}}
48
49The hashing alrogithms for CHICKEN themselves typically just embed a C
50implementation and don't require any other eggs other than
51{{message-digest-primitive}}.
52
53If you want to be hash-algorithm agnostic, this should still be
54manageable as long as all {{hash-*}} eggs follow a certain standard.
55
56
57=== 3-function interface
58It's possible to offer fast hash algorithms for many (all?) hashing
59alrorithm given only three functions: {{init}}, {{update!}} and {{final!}}.
60
61By convention, we could force all {{(import (hash <alg>))}} modules/eggs
62for CHICKEN 5 to export these 4 identifiers:
63
64<constant> digest-length</constant>
65
66Number of bytes returned as blob/string in {{final!}}. Eg. 32 for
67sha256's.
68
69    [pocedure] (init [ args ... ] )
70
71Returns a fresh context that can be passed to {{update!}} and {{final!}},
72and only to those two procedures. It can be any scheme object, but it
73is generally a good idea to use something that checks for the right
74type as this will produce errors (instead of segfaults) if {{update!}}
75and {{final!}} are given wrong contexts. Contexts are therefore
76typically wrapped in their own record types.
77
78{{context}} cannot (usually) be inspected directly.
79
80{{init}} may contain any number of arguments depending on the hashing
81algorithm.
82
83blockquotehmac's {{init}} will probably need some data to initialize itself.
84
85<procedure> (update! context data #!optional len offset)</procedure>
86
87Mutates context by input {{data}}, which may be of different types. The
88type of {{data}} may be:
89
90* a string
91* a blob
92* a u8vector
93* a c-pointer
94
95{{len}} specifies the number of bytes to read from {{data}}, starting from
96{{offset}}. It defaults to {{#f}}, indicating to read all bytes in
97{{data}}. {{len}} can be any positive fixnum and is only limited by memory
98available.
99
100It is an error if {{len}} is not given (or {{#f}}) if the type of data is
101c-pointer.
102
103{{offset}} is always optional and specifies the number of bytes to skip
104and defaults to {{0}}. It is an error if {{(> (+ len offset)
105(number-of-bytes data))}}.
106
107{{update!}} returns {{context}} for convenience.
108
109<procedure> (final! context #!optional destination)</procedure>
110
111Fills {{destination}} with the hashing algorithms result reflected in
112{{context}}. {{destination}} defaults to a blob of length {{digest-length}},
113but may also be given as a string of the same length (that will be
114overwritten).
115
116{{final!}} returns {{destination}}.
117
118{{final!}} may modify {{context}} and should only be called once for each
119context.
120
121
122=== Simple example
123A {{sha256sum}} application could look like this:
124
125<enscript highlight="scheme">    
126(import (chicken port) (chicken io)
127        (hash sha256))
128
129(define c (init))
130(port-for-each (lambda (s) (update! c s))
131               (lambda () (read-string 1024)))
132(print (final! c))
133</enscript>
134
135If another hashing algorithm is desired, only the import needs to be
136changed. If multiple hashing alrorithms are used, the imports will
137need to be prefixed or lexically scoped:
138
139<enscript highlight="scheme">    
140(import (chicken io) (chicken process-context))
141
142(define sha256
143  (lambda ()
144    (import (hash sha256))
145    (values (init) update! final!)))
146
147(define sha512
148  (lambda ()
149    (import (hash sha512))
150    (values (init) update! final!)))
151
152(define alg* (car (command-line-arguments)))
153(define alg
154  (cond ((equal? "sha256" alg*) sha256)
155        ((equal? "sha512" alg*) sha512)
156        (else (error "unknown algorithm" alg*))))
157
158(receive (ctx update! final!) (alg)
159  (print (final! (update! ctx (read-string)))))
160</enscript>
161
162
163=== The return value of {{update!}}
164If we say {{update!}} must always return the given context, we can make
165this shortcut:
166
167<enscript highlight="scheme">    
168(import (hash sha256) (chicken io))
169(print (final! (update! (init) (read-string))))
170</enscript>
171
172This is nice because, as this is a common use-case, this is concise
173enough to (hopefully) avoid the need for util functions.
174
175
176=== Imports
177We could name them like this for example:
178
179* {{(import (hash blake2))}}
180* {{(import (hash crc32))}}
181* {{(import (hash hmac))}} <-- would this be hard?
182* {{(import (hash md5))}}
183* {{(import (hash sha1))}}
184* {{(import (hash sha256))}}
185* {{(import (hash sha512))}}
186
187All of these could be in separate eggs that ''only'' provide the
188{{digest-length}}, {{init}}, {{update!}} and {{final!}} functions. These eggs
189chould have none or few dependencies, and export one or more modules
190under {{(hash <algorithm-name>)}}.
191
192
193=== Hash util egg
194A separate {{(import (hash))}} egg could provide some utils that utilize
195the algorithms. It could provide:
196
197* blob->hex for easy printing
198* algorithm -> fast port hasher
199* algorithm -> fast file hasher
200
201Plus various other things that {{message-digest}} is good at.
202
203
204=== Sample code
205[[https://github.com/kristianlm/chicken-hash-strategy-proposal|This]]
206repository implements prototypes of this API conventions in
207sub-folders. Run {{chicken-install}} in each sub-directory to give these
208examples a spin.
209
210
211=== Discussion
212
213==== 3-function names
214Is {{init}}, {{update!}} and {{final!}} ok?
215
216
217==== Squash into single-function interface?
218Could the 3-function interface be compressed to a 1-function
219interface? The example to support both sha256 and sha512 feels a
220little clumpsy.
221
222
223===== 1-function interface proposal 1
224Each hash algorithm egg/module could export a single identifier like
225the {{sha512}} procedure above, that returns three values:
226
227* an newly created context
228* an {{update!}} procedure
229* a {{final!}} procedure
230
231This gives us a single point of entry, which is nice, but it means all
232users have to deal with multiple values.
233
234
235===== 1-function interface proposal 2
236Each hash algorithm egg/module could export a single identifier that
237returns a procedure that can be passed arguments of different argument
238types:
239
240<enscript highlight="scheme">    
241(define hasher (sha256))
242(hasher "abc") ;; string or blob for update!
243(print (hasher #f)) ;; #f for final!
244</enscript>
245
246* Good: You can pass the hashing "primitive" around as a single value,
247  just like a message-digest primitive.
248* Good: algorithm eggs export only 1 identifier, so naming conventions
249  are much less rigid
250* Bad: The dynamic dispatching of the returned procedure might make
251  type-checking and debugging harder.
252
253
254==== Chunk size
255Are we making it impossible to optimize this if we don't know some
256internal stuff like chunk-size? message-digest-primitive seems to be
257using it.
258
259
260==== Immutable version of {{final!}}
261Could it ever be useful to have a {{final!}} that doesn't mutate the
262context?
263
Note: See TracBrowser for help on using the repository browser.