Opened 5 years ago

Last modified 8 days ago

#1374 new defect

`display' issue with UTF-8

Reported by: Mario Domenech Goulart Owned by:
Priority: major Milestone: 5.4
Component: core libraries Version: 4.12.0
Keywords: display, ##sys#print, utf8 Cc:
Estimated difficulty: hard

Description

I received a bug report against awful (https://github.com/mario-goulart/awful/issues/5), but the issue seems to be related to CHICKEN.

Here's a smaller test case to illustrate the problem:

$ cat test.scm

(cond-expand
  (chicken (use utf8))
  (else #f))

(let ((chars (string->list "出")))
  (display "<html><head><meta charset=\"utf-8\"/></head>")
  (display chars)
  (display "<br>")
  (display "(")
  (display (car chars))
  (display ")")
  (display "</html>"))

To see the problem:

$ csi -s test.scm > chicken-out.html
$ firefox chicken-out.html

It seems that display is messing up at printing the list containing the UTF-8 char.

Gauche does the right thing:

$ gosh test.scm > gauche-out.html
$ firefox gauche-out.html

The two output files differ, of course:

$ cmp gauche-out.html chicken-out.html
gauche-out.html chicken-out.html differ: byte 44, line 1

Change History (4)

comment:1 Changed 4 years ago by Kooda

That’s just because CHICKEN strings are byte strings, not utf-8 strings.

(use utf8) at the top of the file should solve the issue here.

EDIT: whow, my eyes can’t see. Ignore this.

Last edited 4 years ago by Kooda (previous) (diff)

comment:2 Changed 4 years ago by sjamaan

Estimated difficulty: hard

comment:3 Changed 8 days ago by sjamaan

Milestone: someday5.4

The problem seems to be that the utf8 egg redefines display as a procedure which special-cases characters, but hands off displaying of nested structures to the built-in display, which then messes up.

The built-in display uses outchr which calls the port's write-char procedure, which for regular file-based ports is defined as C_display_char which uses C_fputc.

Perhaps one of these should be changed analogously to ##sys#char->utf8-string so that we're not calling putc directly on wide characters? I think it might be good to do that at the lowest level possible (i.e., either C_display_char or write-char)

comment:4 Changed 8 days ago by sjamaan

hmm, on second thought, that would break writing of raw bytes or latin1. Perhaps this is better solved by the utf8 egg overloading the port with a custom port that calls the underlying port's write-char in the described way, and then handing that off to the built-in display?

Last edited 8 days ago by sjamaan (previous) (diff)
Note: See TracTickets for help on using tickets.