Opened 5 years ago

Last modified 10 months ago

#1374 new defect

`display' issue with UTF-8

Reported by: Mario Domenech Goulart Owned by:
Priority: major Milestone: 5.4
Component: core libraries Version: 4.12.0
Keywords: display, ##sys#print, utf8 Cc:
Estimated difficulty: hard


I received a bug report against awful (, but the issue seems to be related to CHICKEN.

Here's a smaller test case to illustrate the problem:

$ cat test.scm

  (chicken (use utf8))
  (else #f))

(let ((chars (string->list "出")))
  (display "<html><head><meta charset=\"utf-8\"/></head>")
  (display chars)
  (display "<br>")
  (display "(")
  (display (car chars))
  (display ")")
  (display "</html>"))

To see the problem:

$ csi -s test.scm > chicken-out.html
$ firefox chicken-out.html

It seems that display is messing up at printing the list containing the UTF-8 char.

Gauche does the right thing:

$ gosh test.scm > gauche-out.html
$ firefox gauche-out.html

The two output files differ, of course:

$ cmp gauche-out.html chicken-out.html
gauche-out.html chicken-out.html differ: byte 44, line 1

Change History (4)

comment:1 Changed 5 years ago by Kooda

That’s just because CHICKEN strings are byte strings, not utf-8 strings.

(use utf8) at the top of the file should solve the issue here.

EDIT: whow, my eyes can’t see. Ignore this.

Last edited 5 years ago by Kooda (previous) (diff)

comment:2 Changed 5 years ago by sjamaan

Estimated difficulty: hard

comment:3 Changed 10 months ago by sjamaan

Milestone: someday5.4

The problem seems to be that the utf8 egg redefines display as a procedure which special-cases characters, but hands off displaying of nested structures to the built-in display, which then messes up.

The built-in display uses outchr which calls the port's write-char procedure, which for regular file-based ports is defined as C_display_char which uses C_fputc.

Perhaps one of these should be changed analogously to ##sys#char->utf8-string so that we're not calling putc directly on wide characters? I think it might be good to do that at the lowest level possible (i.e., either C_display_char or write-char)

comment:4 Changed 10 months ago by sjamaan

hmm, on second thought, that would break writing of raw bytes or latin1. Perhaps this is better solved by the utf8 egg overloading the port with a custom port that calls the underlying port's write-char in the described way, and then handing that off to the built-in display?

Last edited 10 months ago by sjamaan (previous) (diff)
Note: See TracTickets for help on using tickets.