Opened 3 months ago
Last modified 25 hours ago
#1851 new defect
utf8 egg: Missing char sets and outdated tables
Reported by: | Zipheir | Owned by: | |
---|---|---|---|
Priority: | minor | Milestone: | someday |
Component: | unknown | Version: | 5.4.0 |
Keywords: | unicode | Cc: | |
Estimated difficulty: |
Description
The unicode-char-sets module of the utf8 egg is missing several character sets. In particular, there is no set for characters with the Numeric property (making it impossible to implement a Unicode-aware 'char-numeric?' in CHICKEN) or for any of the punctuation properties. The utf8-srfi-14 module includes char-set:digit and char-set:punctuation, but these are throwaway ASCII-only implementations (in a file that begins with "Unicode capable char-sets", no less!). These sets should be added.
Furthermore, the sets that unicode-char-sets does provide seem to be built on data that is extremely out-of-date. The header comment in unicode-char-sets.scm claims the tables were generated in 2007.
Attachments (4)
Change History (6)
comment:1 Changed 6 days ago by
Changed 25 hours ago by
Attachment: | generate-sets.sh added |
---|
Trivial driver script (run by custom-build).
comment:2 Changed 25 hours ago by
I've attached my set-module generation script & a patch for utf8.egg.
Changed 25 hours ago by
Attachment: | generate-sets.scm added |
---|
Script to generate Unicode char set modules from current UCD data.
Changed 24 hours ago by
Attachment: | unicode-char-sets.scm added |
---|
Catch-all unicode-char-sets module.
I'm working on an updated version of the utf8 egg which fetches & generates the Unicode character sets from the official tables. In the process, I've learned a lot more about the complexities of the Unicode property architecture.
While I think char-set:numeric is still absolutely necessary & should be added, I now believe that the egg authors did the right thing in not extending beyond ASCII char-set:digit, char-set:punctuation, & the rest of the old SRFI 14 sets. This is a bit confusing, however, since the names of some utf8-srfi-14 sets are very similar to those in unicode-char-sets, e.g. char-set:whitespace & char-set:white-space.
Since some Unicode char sets are quite large, I propose splitting (unicode-char-sets) into submodules, each containing one set. For example, char-set:arabic should be provided by (unicode-char-sets arabic). I'm implementing this in my new version of the egg.
More soon.