source: project/wiki/eggref/4/utf8 @ 31185

Last change on this file since 31185 was 31185, checked in by sjamaan, 6 years ago

Add an example on how to rename byte string operations (so we can close #1143)

File size: 12.0 KB
Line 
1[[tags: egg]]
2
3== utf8
4
5[[toc:]]
6
7=== Description
8
9Unicode support
10
11=== Author
12
13Alex Shinn
14
15=== Requirements
16
17[[iset]]
18
19
20=== Documentation
21
22To make your code Unicode aware, just do the following:
23
24  (require-extension utf8)
25
26then all core, extra and regex string operations will be Unicode
27aware. {{string-length}} will return the number of codepoints, not
28the number of bytes, {{string-ref}} will index by codepoints and
29return a char with an integer value up to 2^21, regular expressions
30will match single codepoints rather than bytes and understand Unicode
31character classes, etc.
32
33Strings are still native strings and may be passed to external
34libraries (either Scheme or foreign) perfectly safely. Libraries that
35do parsing invariably do so on ASCII character boundaries and are thus
36guaranteed to be compatible. Libraries that reference strings by
37index would need to be modified with a UTF-8 version. Currently all
38existing eggs are UTF-8 safe to my knowledge.
39
40This extension does not load into the toplevel, it is composed of modules. So
41it must be imported. Since exported identifiers match those of common Chicken
42imports the conflicts must be excluded. Use the examples below to stop such
43conflicts:
44
45  (import
46    (except scheme
47      string-length string-ref string-set! make-string string substring
48      string->list list->string string-fill! write-char read-char display)
49    (except chicken
50      reverse-list->string print print*)
51    (except data-structures
52      ->string conc string-chop string-split string-translate
53      substring=? substring-ci=? substring-index substring-index-ci)
54    (except extras
55      read-string write-string read-token)
56    (except regexp
57      grep regexp string-substitute string-substitute* string-split-fields
58      string-match string-match-positions string-match-offsets
59      string-search string-search-positions string-search-offsets) )
60
61''Note'' that not all Chicken string routines have a utf8 version yet:
62
63; Unit data-structures : {{string-chomp}}, {{string-compare3}}, {{reverse-string-append}}
64; Unit extras : {{pretty-print}}, {{printf}}, {{sprintf}}, {{fprintf}}, {{read-line}}, {{write-line}}, {{read-lines}}
65; Unit irregex : (already utf8 aware, unless disabled)
66; Unit utils : {{scan-input-lines}}
67
68To use Unicode-aware SRFI-13 and SRFI-14 using UTF-8 semantics:
69
70  (require-extension utf8-srfi-13)
71  (require-extension utf8-srfi-14)
72
73The SRFI-14 module provides an alternative to the standard Chicken
74SRFI-14. As a pure superset which handles arbitrary-sized characters
75it should be usable as a drop-in replacement. The only aspect related
76to UTF-8 is STRING->CHAR-SET assumes the string is UTF-8 encoded.
77
78==== Unicode char-sets
79
80The default SRFI-14 char-sets are defined using ASCII-only characters,
81since this is both useful and lighter-weight. To obtain full Unicode
82char-set definitions, use the unicode-char-sets unit:
83
84  (require-extension unicode-char-sets)
85
86[Note this is the only extension in this egg with a {{unicode-}}
87prefix, because the char-set handling only depends on individual
88characters and is independent of the character encoding used in
89strings.]
90
91The following char-sets are provided based on the Unicode properties:
92
93  char-set:alphabetic
94  char-set:arabic
95  char-set:armenian
96  char-set:ascii-hex-digit
97  char-set:bengali
98  char-set:bidi-control
99  char-set:bopomofo
100  char-set:braille
101  char-set:buhid
102  char-set:canadian-aboriginal
103  char-set:cherokee
104  char-set:common
105  char-set:cypriot
106  char-set:cyrillic
107  char-set:dash
108  char-set:default-ignorable-code-point
109  char-set:deprecated
110  char-set:deseret
111  char-set:devanagari
112  char-set:diacritic
113  char-set:ethiopic
114  char-set:extender
115  char-set:georgian
116  char-set:gothic
117  char-set:grapheme-base
118  char-set:grapheme-extend
119  char-set:grapheme-link
120  char-set:greek
121  char-set:gujarati
122  char-set:gurmukhi
123  char-set:han
124  char-set:hangul
125  char-set:hanunoo
126  char-set:hebrew
127  char-set:hex-digit
128  char-set:hiragana
129  char-set:hyphen
130  char-set:id-continue
131  char-set:id-start
132  char-set:ideographic
133  char-set:ids-binary-operator
134  char-set:ids-trinary-operator
135  char-set:inherited
136  char-set:join-control
137  char-set:kannada
138  char-set:katakana
139  char-set:katakana-or-hiragana
140  char-set:khmer
141  char-set:lao
142  char-set:latin
143  char-set:limbu
144  char-set:linear-b
145  char-set:logical-order-exception
146  char-set:lowercase
147  char-set:malayalam
148  char-set:math
149  char-set:mongolian
150  char-set:myanmar
151  char-set:noncharacter-code-point
152  char-set:ogham
153  char-set:old-italic
154  char-set:oriya
155  char-set:osmanya
156  char-set:quotation-mark
157  char-set:radical
158  char-set:runic
159  char-set:shavian
160  char-set:sinhala
161  char-set:soft-dotted
162  char-set:sterm
163  char-set:syriac
164  char-set:tagalog
165  char-set:tagbanwa
166  char-set:tai-le
167  char-set:tamil
168  char-set:telugu
169  char-set:terminal-punctuation
170  char-set:thaana
171  char-set:thai
172  char-set:tibetan
173  char-set:ugaritic
174  char-set:unified-ideograph
175  char-set:uppercase
176  char-set:variation-selector
177  char-set:white-space
178  char-set:xid-continue
179  char-set:xid-start
180  char-set:yi
181
182==== Unicode case-mappings
183
184The SRFI-13 case-mapping procedures ({{string-upcase}}, etc.) are
185defined using only ASCII case-mappings, since this is both useful and
186lighter-weight. To get full Unicode aware case-mappings, do
187
188  (require-extension utf8-case-map)
189
190which provides the {{utf8-string-upcase}}, {{utf8-string-downcase}},
191and {{utf8-string-titlecase}} procedures. These take a first argument
192of either a string or port, and an optional second argument of locale
193(as a string), returning the appropriate locale-aware case-mapped string.
194
195==== Byte-strings
196
197Sometimes you may need access to the original string primitives so you
198can directly access bytes, such as if you were implementing your own
199regex library or text buffer and wanted optimal performance. For
200these cases you can simply import and rename or prefix the string
201procedures from the {{scheme}} module, like so:
202
203<enscript highlight="scheme">
204(use utf8)
205(import (rename (only scheme string-ref string-set!)
206                (string-ref byte-ref)
207                (string-set! byte-set!)))
208</encript>
209
210Now, the original string operations which operate at the byte level
211are available as {{byte-ref}} and {{byte-set!}}.
212
213==== Low-level API
214
215Direct manipulation of the utf8 encoding is factored away in the
216{{utf8-lolevel}} unit. This includes an abstract string-pointer API,
217and an analogous string-pointer implementation for ASCII strings in
218the string-pointer unit, however as the API is not fixed you use these
219at your own risk.
220
221==== Limitations
222
223{{peek-char}} currently does not have Unicode semantics (i.e. it peeks
224only a single byte) to avoid problems with port buffering.
225
226char-sets are not interchangeable between the existing srfi-14 code
227and Unicode code (i.e. do not pass a Unicode char-set to an external
228library that directly uses the old srfi-14).
229
230Attempting to mutate literal strings will result in an error if the
231mutated size does not occupy the same number of bytes as the original.
232This is standards compliant, since the programmer is not supposed to
233attempt to mutate literal values, but it may be a little confusing
234since the error is inconsistent.
235
236==== Performance
237
238{{string-length}}, {{string-ref}} and {{string-set!}} are all O(n)
239operations as opposed to the usual O(1) since UTF-8 is a variable
240width encoding. Use of these should be discouraged - it is much
241cleaner to use the high-level SRFI-13 procedures and string ports.
242For examples of how to do common idioms without these procedures look
243at any string-based code in Gauche.
244
245Furthermore, {{string-set!}} and other procedures that modify strings
246in place may invoke gc if the mutated result does not fit within the
247same UTF-8 encoding size as the original string. If only mutating
2487-bit ASCII strings (or only mutating within fixed encoding sizes such
249as Cyrillic->Cyrillic) then no gc will occur.
250
251{{string?}}, {{string=?}}, {{string-append}}, all R5RS string
252comparisons, and {{read-line}} are unmodified.
253
254Regular expression matching will be just as fast except in the case of
255Unicode character classes (which were not possible before anyway).
256
257All other procedures incur zero to minor overhead, but keep the same
258asymptotic performance.
259
260==== Discussion
261
262There are two ways to add Unicode string support to an existing
263language: redefine the strings themselves (i.e. add a new string
264type), or redefine the operations on the strings. The former causes a
265schism in your string libraries, dividing them between Unicode-aware
266and not, either doubling your library implementations or limiting them
267to one type or the other. You can't freely pass strings to other
268libraries without keeping track of their types and converting when
269needed. It becomes slow and unwieldy. C and Perl are the only
270language I know of who seriously tried this. In Perl the modules
271which worked with Unicode strings were minimal, frequent type
272conversions were needed, a general mess ensued, and Perl very quickly
273switched to the latter approach. In C as well, the libraries
274supporting wchar are still minimal, while most libraries still only
275support char.
276
277UTF-8 is ideal for the in-place sort of extension because it is
278backwards compatible with ASCII. Any ASCII (7-bit) byte found within
279a UTF-8 string is guaranteed to be that character, not part of a
280multibyte character, so parsing libraries that work on ASCII
281characters work unmodified. This includes most existing text formats
282and network protocols. The EUC (Extended Unix Code) encodings also
283have this feature so a similar module could be implemented allowing
284users to {{(require 'euc-jp)}} for example and work in Japanese EUC
285rather than Unicode. Other encodings such as Shift_JIS satisfy the
286requirement that an ASCII string has the same meaning in the encoding,
287but multibyte characters in the encoding may include ASCII bytes,
288breaking the rule we need for safe ASCII parsing. A few encodings
289like UTF-16 and UTF-32 are completely incompatible. UTF-16 is
290primarily only used these days by Java, a victim of the unfortunate
291fact that at first UTF-16 was fixed width but is no longer with the
292advent of surrogate pairs. Note that even without this module you can
293write source code in Chicken in any ASCII compatible encoding like
294ISO-8859-* or UTF-8 and define symbols with that encoding (letting you
295replace lambda with syntax for a real greek lambda, for example).
296
297Other languages that use UTF-8 include Perl, Python, TCL. XML and
298increasingly more and more network standards are using UTF-8 by
299default, and major databases all support UTF-8. Libraries with UTF-8
300support include Gtk, SDL, and freetype.
301
302=== Changelog
303
304; 3.3.0 ;
305; 3.2.0 ;
306; 3.1.0 ;
307; 3.0.0 ; Hello
308
309=== License
310
311Copyright (c) 2004-2008, Alex Shinn
312All rights reserved.
313
314Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
315conditions are met:
316
317  Redistributions of source code must retain the above copyright notice, this list of conditions and the following
318.  disclaimer.
319  Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
320.  disclaimer in the documentation and/or other materials provided with the distribution.
321  Neither the name of the author nor the names of its contributors may be used to endorse or promote
322.  products derived from this software without specific prior written permission.
323
324THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
325OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
326AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
327CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
328CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
329SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
330THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
331OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
332POSSIBILITY OF SUCH DAMAGE.
Note: See TracBrowser for help on using the repository browser.