Opened 12 years ago

Closed 4 years ago

#768 closed defect (fixed)

s11n: reconstruction of hash-tables incorrect on de-serialization

Reported by: felix winkelmann Owned by:
Priority: critical Milestone:
Component: extensions Version: 4.7.x
Keywords: s11n Cc:
Estimated difficulty: hard

Description

As pointed out by sjamaan, the reconstruction of hash-tables on deserialization is incorrect and creates a hash-table record that is missing several slots.

Change History (11)

comment:1 Changed 8 years ago by sjamaan

Estimated difficulty: hard

comment:2 Changed 5 years ago by Mario Domenech Goulart

Here's a simple example to illustrate the problem:

$ cat s11n-ht.scm
(use srfi-69)
(use s11n)

(define ht1
  (alist->hash-table '((foo . bar))))

(define ht2
  (alist->hash-table `((foo . bar)
                       (ht . ,ht1))))

(with-output-to-file "serialized.s11n"
  (lambda ()
    (serialize ht2)))

(pp (hash-table->alist
     (with-input-from-file "serialized.s11n" deserialize)))
$ csi -s s11n-ht.scm
((foo . bar) (ht . #<hash-table (1)>))

$ csi -s s11n-ht.scm
((ht . #<hash-table (1)>))

$ csi -s s11n-ht.scm
((ht . #<hash-table (1)>) (foo . bar))

$ csi -s s11n-ht.scm
((foo . bar) (ht . #<hash-table (1)>))

$ csi -s s11n-ht.scm
((ht . #<hash-table (1)>) (foo . bar))

$ csi -s s11n-ht.scm
((ht . #<hash-table (1)>) (foo . bar))

$ csi -s s11n-ht.scm
((ht . #<hash-table (1)>) (foo . bar))

$ csi -s s11n-ht.scm
((ht . #<hash-table (1)>))

$ csi -s s11n-ht.scm
((ht . #<hash-table (1)>))

$ csi -s s11n-ht.scm
((foo . bar) (ht . #<hash-table (1)>))

Maybe I was not creative enough, but I could only reproduce this behavior with nested hash-tables.

comment:3 Changed 4 years ago by megane

See also segfaulting example in #1692

comment:4 Changed 4 years ago by Idiomdrottning

(Note to self as much as anything else)

The fact that the order is sometimes different isn't weird, running the same script without the round trip through the file exhibits that behavior as well. The problem is the missing slot.

Version 0, edited 4 years ago by Idiomdrottning (next)

comment:5 Changed 4 years ago by Idiomdrottning

I've also captured a file that exhibited the (foo . bar) (ht . #<hash-table (1)>)) behavior, but looking inside it was exactly the same as the other order.

In other words: ht foo-bar vs foo-bar ht has the same serialization, the order-difference arises when creating the hash-table during deserialization. Which is fine.

Whereas the serializations that only exhibit ht does have foo-bar stored in them, but it is as back-references to the inner hash-table. These files can't be deserialized properly no matter how often I try. Whether this is a bug in serialization or deserialization is TBD.

Last edited 4 years ago by Idiomdrottning (previous) (diff)

comment:6 Changed 4 years ago by Idiomdrottning

Hash-tables are stored as alists preceded by a special tag (which is implemented as a specific byte, decimal 15 referred to as hash-table-tag).

comment:7 Changed 4 years ago by Idiomdrottning

OK, so the "inner refence" thing was a complete red herring because even when the hash tables use different symbols, the same problem can happen. Good to know! Investigation continues.

comment:8 Changed 4 years ago by Idiomdrottning

I have fixed the bug: https://ellen.idiomdrottning.org/s11n

I've changed deserialize and chicken-dump while I've left serialize alone.

Here is what was wrong:
srfi-69 hash-tables in chicken 5 have eleven slots, more than I guess previous versions must've had because the deserializer stops fetching after just fetching the test & the hash function. Leading the *next* fetcher to fetch things that were actually emitted as part of the hash-table. It was not just hash-tables in hash-tables that was borked, it was hash-tables in lists too.

Slot 5 is the min-load value but the deserializer believes that it is about to read the cdr of the linked-list that the hash-table is inside. So it sets the cdr to 0.5 or whatever, terminating the list prematurely, losing the other values that was in the list.

(list 'blue 'jeans ht1 'white 't-shirt) was also borked, it'd lose the white t-shirt because it thought cdr was 0.5 instead of the remaining links to the list.

comment:9 Changed 4 years ago by Idiomdrottning

About backwards compatibility: deserialize and chicken-dump in my version now expect eleven slots whereas they used to expect five. "So why not change the tag to new-hash-table-tag or similar?"

The problem is that serialize itself was (and still is) implemented as "emit all the slots of hash-table", not just the first five. So for many many years, serialize has emitted eleven-slot hash-tables already while deserialize/chicken-dump has tried to snarf only five. Backwards compatibility was *already* broken.

comment:10 Changed 4 years ago by Idiomdrottning

uploaded to less flaky server: https://idiomdrottning.org/s11n/

comment:11 Changed 4 years ago by felix winkelmann

Resolution: fixed
Status: newclosed

I've applied the patch and tagged the egg as 0.9.12 - many thanks!

Note: See TracTickets for help on using tickets.