Opened 14 years ago

Closed 13 years ago

#325 closed enhancement (fixed)

ssax and Unicode entities

Reported by: Ivan Raikov Owned by: Jim Ursetto
Priority: major Milestone:
Component: extensions Version: 4.5.x
Keywords: ssax unicode entity Cc:
Estimated difficulty:

Description

The SSAX library includes a file, html-entity-codes.scm, that defines a table of Unicode entity codes. This table could be used to replace ssax:predefined-parsed-entities, so that XHTML files that include Unicode entity codes can be parsed with SSAX. However, the Chicken sxml-transforms library does not export these definitions.

The code that constructs the table of Unicode entity codes assumes Unicode-aware integer->char, so it should not be included in sxml-transforms by default. Perhaps we need to a separate egg, something like utf8-sxml-transforms that utilizes the utf8 library?

Change History (6)

comment:1 Changed 14 years ago by Jim Ursetto

Milestone: 4.6.0

Ivan,

integer->char is already unicode aware.

$ csi -e '(write (integer->char #x3bb))'
\u3bb

What is not unicode aware is converting this character to a string using (string c) or (make-string 1 c). However, you can use the system procedure ##sys#char->utf8-string:

csi -p '(##sys#char->utf8-string (integer->char #x3bb))'
λ

It is trivial to modify html-entity-unicode-chars to call this instead of core make-string, which will not require the utf8 egg. In fact I did a similar thing in the 'ssax' egg so that numeric entities are parsed into utf8 sequences without requiring the utf8 egg.

Speaking of which, shouldn't this be filed against the ssax egg and not against sxml-transforms?

comment:2 Changed 14 years ago by sjamaan

Summary: sxml-transforms and Unicode entitiesssax and Unicode entities

I think that's right. ssax is about parsing, sxml-transforms is about transformation of already-parsed sxml.

I'll change the title to match. I don't know who maintains ssax, but I don't.

comment:3 Changed 14 years ago by Jim Ursetto

Keywords: ssax added; sxml-transforms removed
Owner: changed from sjamaan to Jim Ursetto
Status: newassigned

No one really maintains it, but I can make this change.

Ivan, do you just want me to export html-entity-unicode-chars from ssax, after fixing it to handle utf8? I don't know to use the lowlevel SSAX parser so if you could give me a short test case that would be appreciated.

comment:4 in reply to:  3 Changed 14 years ago by Ivan Raikov

Yes, that would be fine. I also have not used the low level parser, but I will look at it and see if I can make it work. Any XML file that includes a Unicode entity (e.g. hellip) would be a good test case.

Replying to zbigniew:

No one really maintains it, but I can make this change.

Ivan, do you just want me to export html-entity-unicode-chars from ssax, after fixing it to handle utf8? I don't know to use the lowlevel SSAX parser so if you could give me a short test case that would be appreciated.

comment:5 Changed 13 years ago by sjamaan

Isn't this fixed now? Can this ticket be closed?

comment:6 Changed 13 years ago by Jim Ursetto

Resolution: fixed
Status: assignedclosed

Added html-entity-unicode-chars in version 5.0.5. I have absolutely no idea how to use it --- you can't pass it to ssax:xml->sxml --- but at least it has the right value!

Note: See TracTickets for help on using tickets.