ssax and Unicode entities

The SSAX library includes a file, html-entity-codes.scm, that defines a table of Unicode entity codes. This table could be used to replace ssax:predefined-parsed-entities, so that XHTML files that include Unicode entity codes can be parsed with SSAX. However, the Chicken sxml-transforms library does not export these definitions.

The code that constructs the table of Unicode entity codes assumes Unicode-aware integer->char, so it should not be included in sxml-transforms by default. Perhaps we need to a separate egg, something like utf8-sxml-transforms that utilizes the utf8 library?

integer->char is already unicode aware.

$ csi -e '(write (integer->char #x3bb))'

What is not unicode aware is converting this character to a string using (string c) or (make-string 1 c). However, you can use the system procedure ##sys#char->utf8-string:

csi -p '(##sys#char->utf8-string (integer->char #x3bb))'

It is trivial to modify html-entity-unicode-chars to call this instead of core make-string, which will not require the utf8 egg. In fact I did a similar thing in the 'ssax' egg so that numeric entities are parsed into utf8 sequences without requiring the utf8 egg.

Speaking of which, shouldn't this be filed against the ssax egg and not against sxml-transforms?

I think that's right. ssax is about parsing, sxml-transforms is about transformation of already-parsed sxml.

I'll change the title to match. I don't know who maintains ssax, but I don't.

No one really maintains it, but I can make this change.

Ivan, do you just want me to export html-entity-unicode-chars from ssax, after fixing it to handle utf8? I don't know to use the lowlevel SSAX parser so if you could give me a short test case that would be appreciated.

comment:4 in reply to:  3 Changed 15 years ago by Ivan Raikov

Yes, that would be fine. I also have not used the low level parser, but I will look at it and see if I can make it work. Any XML file that includes a Unicode entity (e.g. hellip) would be a good test case.

Replying to zbigniew:

No one really maintains it, but I can make this change.

Ivan, do you just want me to export html-entity-unicode-chars from ssax, after fixing it to handle utf8? I don't know to use the lowlevel SSAX parser so if you could give me a short test case that would be appreciated.

Isn't this fixed now? Can this ticket be closed?

Added html-entity-unicode-chars in version 5.0.5. I have absolutely no idea how to use it --- you can't pass it to ssax:xml->sxml --- but at least it has the right value!

