Opened 13 years ago
Closed 4 years ago
#768 closed defect (fixed)
s11n: reconstruction of hash-tables incorrect on de-serialization
Reported by: | felix winkelmann | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | |
Component: | extensions | Version: | 4.7.x |
Keywords: | s11n | Cc: | |
Estimated difficulty: | hard |
Description
As pointed out by sjamaan, the reconstruction of hash-tables on deserialization is incorrect and creates a hash-table record that is missing several slots.
Change History (11)
comment:1 Changed 8 years ago by
Estimated difficulty: | → hard |
---|
comment:2 Changed 6 years ago by
comment:4 Changed 4 years ago by
(Note to self as much as anything else)
The fact that the order is sometimes different isn't weird, running the same script without the round trip through the file exhibits that behavior as well. The problem is the missing slot.
Doing some more tests:
- running the same script but with multiple calls to deserialization of the same file sometimes changes the order even within the same call of the script, but the amount of slots present is the same for each file.
- running the same script but with multiple calls to hash-table->alist from the same deserialization always gives the same output (for that given deserialization).
If I ran the script millions of times perhaps I'd see some swans of a different color, but from what I can see this tells me that the order of the slots can change during the deserialization, while the absense of slots is set in the serialization.
I have a "ht" file (chicken-dump serialization.s11n > ht # on a run that didn't have (foo . bar)) and a "ht-foo-bar" file (chicken-dump serialization.s11n > ht # on a run that did). Looking at the difference between both files through a diff viewer like meld shows me that in the version of serialization.s11n that lost the (foo . bar), it *is* in there, it's just that those symbols are stored as back-references to within the hash-table. Remember that symbols share their data in memory, unlike strings.
comment:5 Changed 4 years ago by
I've also captured a file that exhibited the (foo . bar) (ht . #<hash-table (1)>))
behavior, but looking inside it was exactly the same as the other order.
In other words: ht foo-bar vs foo-bar ht has the same serialization, the order is different when creating the hash-table during deserialization.
Whereas the serializations that only exhibit ht does have foo-bar stored in them, but it is as back-references to the inner hash-table.
comment:6 Changed 4 years ago by
Hash-tables are stored as alists preceded by a special tag (which is implemented as a specific byte, decimal 15 referred to as hash-table-tag).
comment:7 Changed 4 years ago by
OK, so the "inner refence" thing was a complete red herring because even when the hash tables use different symbols, the same problem can happen. Good to know! Investigation continues.
comment:8 Changed 4 years ago by
I have fixed the bug: https://ellen.idiomdrottning.org/s11n
I've changed deserialize
and chicken-dump
while I've left serialize
alone.
Here is what was wrong:
srfi-69 hash-tables in chicken 5 have eleven slots, more than I guess previous versions must've had because the deserializer stops fetching after just fetching the test & the hash function. Leading the *next* fetcher to fetch things that were actually emitted as part of the hash-table. It was not just hash-tables in hash-tables that was borked, it was hash-tables in lists too.
Slot 5 is the min-load value but the deserializer believes that it is about to read the cdr of the linked-list that the hash-table is inside. So it sets the cdr to 0.5 or whatever, terminating the list prematurely, losing the other values that was in the list.
(list 'blue 'jeans ht1 'white 't-shirt)
was also borked, it'd lose the white t-shirt because it thought cdr was 0.5 instead of the remaining links to the list.
comment:9 Changed 4 years ago by
About backwards compatibility: deserialize and chicken-dump in my version now expect eleven slots whereas they used to expect five. "So why not change the tag to new-hash-table-tag
or similar?"
The problem is that serialize itself was (and still is) implemented as "emit all the slots of hash-table", not just the first five. So for many many years, serialize has emitted eleven-slot hash-tables already while deserialize/chicken-dump has tried to snarf only five. Backwards compatibility was *already* broken.
comment:11 Changed 4 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
I've applied the patch and tagged the egg as 0.9.12 - many thanks!
Here's a simple example to illustrate the problem:
Maybe I was not creative enough, but I could only reproduce this behavior with nested hash-tables.