Opened 4 months ago

Last modified 4 weeks ago

#1851 new defect

utf8 egg: Missing char sets and outdated tables

Reported by: Zipheir Owned by:
Priority: minor Milestone: someday
Component: unknown Version: 5.4.0
Keywords: unicode Cc:
Estimated difficulty:

Description

The unicode-char-sets module of the utf8 egg is missing several character sets. In particular, there is no set for characters with the Numeric property (making it impossible to implement a Unicode-aware 'char-numeric?' in CHICKEN) or for any of the punctuation properties. The utf8-srfi-14 module includes char-set:digit and char-set:punctuation, but these are throwaway ASCII-only implementations (in a file that begins with "Unicode capable char-sets", no less!). These sets should be added.

Furthermore, the sets that unicode-char-sets does provide seem to be built on data that is extremely out-of-date. The header comment in unicode-char-sets.scm claims the tables were generated in 2007.

Attachments (4)

generate-sets.sh (181 bytes) - added by zaifir 4 weeks ago.
Trivial driver script (run by custom-build).
generate-sets.scm (12.0 KB) - added by zaifir 4 weeks ago.
Script to generate Unicode char set modules from current UCD data.
unicode-char-sets.scm (2.7 KB) - added by zaifir 4 weeks ago.
Catch-all unicode-char-sets module.
utf8_egg.diff (7.6 KB) - added by zaifir 4 weeks ago.
Changes to utf8 egg file.

Download all attachments as: .zip

Change History (7)

comment:1 Changed 5 weeks ago by zaifir

I'm working on an updated version of the utf8 egg which fetches & generates the Unicode character sets from the official tables. In the process, I've learned a lot more about the complexities of the Unicode property architecture.

While I think char-set:numeric is still absolutely necessary & should be added, I now believe that the egg authors did the right thing in not extending beyond ASCII char-set:digit, char-set:punctuation, & the rest of the old SRFI 14 sets. This is a bit confusing, however, since the names of some utf8-srfi-14 sets are very similar to those in unicode-char-sets, e.g. char-set:whitespace & char-set:white-space.

Since some Unicode char sets are quite large, I propose splitting (unicode-char-sets) into submodules, each containing one set. For example, char-set:arabic should be provided by (unicode-char-sets arabic). I'm implementing this in my new version of the egg.

More soon.

Changed 4 weeks ago by zaifir

Attachment: generate-sets.sh added

Trivial driver script (run by custom-build).

comment:2 Changed 4 weeks ago by zaifir

I've attached my set-module generation script & a patch for utf8.egg.

Changed 4 weeks ago by zaifir

Attachment: generate-sets.scm added

Script to generate Unicode char set modules from current UCD data.

Changed 4 weeks ago by zaifir

Attachment: unicode-char-sets.scm added

Catch-all unicode-char-sets module.

Changed 4 weeks ago by zaifir

Attachment: utf8_egg.diff added

Changes to utf8 egg file.

comment:3 Changed 4 weeks ago by zaifir

Per Pietro Cerutti's comments on chicken-users, I've changed the egg file to build the set extensions normally, rather than have it call generate-sets as a custom build script. I expect the egg maintainer to run generate-sets periodically to regenerate the extension files.

The egg also defines an extension for each character set. Redirect complaints about this to Felix; I can't get chicken-install to build the egg correctly with any other configuration.

Note: See TracTickets for help on using tickets.