Opened 3 months ago

Last modified 25 hours ago

#1851 new defect

utf8 egg: Missing char sets and outdated tables

Reported by: Zipheir Owned by:
Priority: minor Milestone: someday
Component: unknown Version: 5.4.0
Keywords: unicode Cc:
Estimated difficulty:

Description

The unicode-char-sets module of the utf8 egg is missing several character sets. In particular, there is no set for characters with the Numeric property (making it impossible to implement a Unicode-aware 'char-numeric?' in CHICKEN) or for any of the punctuation properties. The utf8-srfi-14 module includes char-set:digit and char-set:punctuation, but these are throwaway ASCII-only implementations (in a file that begins with "Unicode capable char-sets", no less!). These sets should be added.

Furthermore, the sets that unicode-char-sets does provide seem to be built on data that is extremely out-of-date. The header comment in unicode-char-sets.scm claims the tables were generated in 2007.

Attachments (4)

generate-sets.sh (181 bytes) - added by zaifir 25 hours ago.
Trivial driver script (run by custom-build).
generate-sets.scm (12.0 KB) - added by zaifir 25 hours ago.
Script to generate Unicode char set modules from current UCD data.
unicode-char-sets.scm (2.7 KB) - added by zaifir 24 hours ago.
Catch-all unicode-char-sets module.
utf8_egg.diff (3.5 KB) - added by zaifir 24 hours ago.
Changes to utf8 egg file.

Download all attachments as: .zip

Change History (6)

comment:1 Changed 6 days ago by zaifir

I'm working on an updated version of the utf8 egg which fetches & generates the Unicode character sets from the official tables. In the process, I've learned a lot more about the complexities of the Unicode property architecture.

While I think char-set:numeric is still absolutely necessary & should be added, I now believe that the egg authors did the right thing in not extending beyond ASCII char-set:digit, char-set:punctuation, & the rest of the old SRFI 14 sets. This is a bit confusing, however, since the names of some utf8-srfi-14 sets are very similar to those in unicode-char-sets, e.g. char-set:whitespace & char-set:white-space.

Since some Unicode char sets are quite large, I propose splitting (unicode-char-sets) into submodules, each containing one set. For example, char-set:arabic should be provided by (unicode-char-sets arabic). I'm implementing this in my new version of the egg.

More soon.

Changed 25 hours ago by zaifir

Attachment: generate-sets.sh added

Trivial driver script (run by custom-build).

comment:2 Changed 25 hours ago by zaifir

I've attached my set-module generation script & a patch for utf8.egg.

Changed 25 hours ago by zaifir

Attachment: generate-sets.scm added

Script to generate Unicode char set modules from current UCD data.

Changed 24 hours ago by zaifir

Attachment: unicode-char-sets.scm added

Catch-all unicode-char-sets module.

Changed 24 hours ago by zaifir

Attachment: utf8_egg.diff added

Changes to utf8 egg file.

Note: See TracTickets for help on using tickets.