source: project/wiki/eggref/4/utf8 @ 31107

Last change on this file since 31107 was 31107, checked in by Mario Domenech Goulart, 6 years ago

utf8 (wiki): remove "Download" section

File size: 11.9 KB
Line 
1[[tags: egg]]
2
3== utf8
4
5[[toc:]]
6
7=== Description
8
9Unicode support
10
11=== Author
12
13Alex Shinn
14
15=== Requirements
16
17[[iset]]
18
19
20=== Documentation
21
22To make your code Unicode aware, just do the following:
23
24  (require-extension utf8)
25
26then all core, extra and regex string operations will be Unicode
27aware. {{string-length}} will return the number of codepoints, not
28the number of bytes, {{string-ref}} will index by codepoints and
29return a char with an integer value up to 2^21, regular expressions
30will match single codepoints rather than bytes and understand Unicode
31character classes, etc.
32
33Strings are still native strings and may be passed to external
34libraries (either Scheme or foreign) perfectly safely. Libraries that
35do parsing invariably do so on ASCII character boundaries and are thus
36guaranteed to be compatible. Libraries that reference strings by
37index would need to be modified with a UTF-8 version. Currently all
38existing eggs are UTF-8 safe to my knowledge.
39
40This extension does not load into the toplevel, it is composed of modules. So
41it must be imported. Since exported identifiers match those of common Chicken
42imports the conflicts must be excluded. Use the examples below to stop such
43conflicts:
44
45  (import
46    (except scheme
47      string-length string-ref string-set! make-string string substring
48      string->list list->string string-fill! write-char read-char display)
49    (except chicken
50      reverse-list->string print print*)
51    (except data-structures
52      ->string conc string-chop string-split string-translate
53      substring=? substring-ci=? substring-index substring-index-ci)
54    (except extras
55      read-string write-string read-token)
56    (except regexp
57      grep regexp string-substitute string-substitute* string-split-fields
58      string-match string-match-positions string-match-offsets
59      string-search string-search-positions string-search-offsets) )
60
61''Note'' that not all Chicken string routines have a utf8 version yet:
62
63; Unit data-structures : {{string-chomp}}, {{string-compare3}}, {{reverse-string-append}}
64; Unit extras : {{pretty-print}}, {{printf}}, {{sprintf}}, {{fprintf}}, {{read-line}}, {{write-line}}, {{read-lines}}
65; Unit irregex : (already utf8 aware, unless disabled)
66; Unit utils : {{scan-input-lines}}
67
68To use Unicode-aware SRFI-13 and SRFI-14 using UTF-8 semantics:
69
70  (require-extension utf8-srfi-13)
71  (require-extension utf8-srfi-14)
72
73The SRFI-14 module provides an alternative to the standard Chicken
74SRFI-14. As a pure superset which handles arbitrary-sized characters
75it should be usable as a drop-in replacement. The only aspect related
76to UTF-8 is STRING->CHAR-SET assumes the string is UTF-8 encoded.
77
78==== Unicode char-sets
79
80The default SRFI-14 char-sets are defined using ASCII-only characters,
81since this is both useful and lighter-weight. To obtain full Unicode
82char-set definitions, use the unicode-char-sets unit:
83
84  (require-extension unicode-char-sets)
85
86[Note this is the only extension in this egg with a {{unicode-}}
87prefix, because the char-set handling only depends on individual
88characters and is independent of the character encoding used in
89strings.]
90
91The following char-sets are provided based on the Unicode properties:
92
93  char-set:alphabetic
94  char-set:arabic
95  char-set:armenian
96  char-set:ascii-hex-digit
97  char-set:bengali
98  char-set:bidi-control
99  char-set:bopomofo
100  char-set:braille
101  char-set:buhid
102  char-set:canadian-aboriginal
103  char-set:cherokee
104  char-set:common
105  char-set:cypriot
106  char-set:cyrillic
107  char-set:dash
108  char-set:default-ignorable-code-point
109  char-set:deprecated
110  char-set:deseret
111  char-set:devanagari
112  char-set:diacritic
113  char-set:ethiopic
114  char-set:extender
115  char-set:georgian
116  char-set:gothic
117  char-set:grapheme-base
118  char-set:grapheme-extend
119  char-set:grapheme-link
120  char-set:greek
121  char-set:gujarati
122  char-set:gurmukhi
123  char-set:han
124  char-set:hangul
125  char-set:hanunoo
126  char-set:hebrew
127  char-set:hex-digit
128  char-set:hiragana
129  char-set:hyphen
130  char-set:id-continue
131  char-set:id-start
132  char-set:ideographic
133  char-set:ids-binary-operator
134  char-set:ids-trinary-operator
135  char-set:inherited
136  char-set:join-control
137  char-set:kannada
138  char-set:katakana
139  char-set:katakana-or-hiragana
140  char-set:khmer
141  char-set:lao
142  char-set:latin
143  char-set:limbu
144  char-set:linear-b
145  char-set:logical-order-exception
146  char-set:lowercase
147  char-set:malayalam
148  char-set:math
149  char-set:mongolian
150  char-set:myanmar
151  char-set:noncharacter-code-point
152  char-set:ogham
153  char-set:old-italic
154  char-set:oriya
155  char-set:osmanya
156  char-set:quotation-mark
157  char-set:radical
158  char-set:runic
159  char-set:shavian
160  char-set:sinhala
161  char-set:soft-dotted
162  char-set:sterm
163  char-set:syriac
164  char-set:tagalog
165  char-set:tagbanwa
166  char-set:tai-le
167  char-set:tamil
168  char-set:telugu
169  char-set:terminal-punctuation
170  char-set:thaana
171  char-set:thai
172  char-set:tibetan
173  char-set:ugaritic
174  char-set:unified-ideograph
175  char-set:uppercase
176  char-set:variation-selector
177  char-set:white-space
178  char-set:xid-continue
179  char-set:xid-start
180  char-set:yi
181
182==== Unicode case-mappings
183
184The SRFI-13 case-mapping procedures ({{string-upcase}}, etc.) are
185defined using only ASCII case-mappings, since this is both useful and
186lighter-weight. To get full Unicode aware case-mappings, do
187
188  (require-extension utf8-case-map)
189
190which provides the {{utf8-string-upcase}}, {{utf8-string-downcase}},
191and {{utf8-string-titlecase}} procedures. These take a first argument
192of either a string or port, and an optional second argument of locale
193(as a string), returning the appropriate locale-aware case-mapped string.
194
195==== Byte-strings
196
197Sometimes you may need access to the original string primitives so you
198can directly access bytes, such as if you were implementing your own
199regex library or text buffer and wanted optimal performance. For
200these cases we have renamed the original primitives by replacing
201{{string}} with {{byte-string}}. Thus {{byte-string-length}} is the
202length in bytes, not characters, of the strings (the equivalent of
203Gauche's {{string-size}}). {{byte-string-set!}} can corrupt the UTF-8
204encoding and should be used sparingly if at all.
205
206==== Low-level API
207
208Direct manipulation of the utf8 encoding is factored away in the
209{{utf8-lolevel}} unit. This includes an abstract string-pointer API,
210and an analogous string-pointer implementation for ASCII strings in
211the string-pointer unit, however as the API is not fixed you use these
212at your own risk.
213
214==== Limitations
215
216{{peek-char}} currently does not have Unicode semantics (i.e. it peeks
217only a single byte) to avoid problems with port buffering.
218
219char-sets are not interchangeable between the existing srfi-14 code
220and Unicode code (i.e. do not pass a Unicode char-set to an external
221library that directly uses the old srfi-14).
222
223Attempting to mutate literal strings will result in an error if the
224mutated size does not occupy the same number of bytes as the original.
225This is standards compliant, since the programmer is not supposed to
226attempt to mutate literal values, but it may be a little confusing
227since the error is inconsistent.
228
229==== Performance
230
231{{string-length}}, {{string-ref}} and {{string-set!}} are all O(n)
232operations as opposed to the usual O(1) since UTF-8 is a variable
233width encoding. Use of these should be discouraged - it is much
234cleaner to use the high-level SRFI-13 procedures and string ports.
235For examples of how to do common idioms without these procedures look
236at any string-based code in Gauche.
237
238Furthermore, {{string-set!}} and other procedures that modify strings
239in place may invoke gc if the mutated result does not fit within the
240same UTF-8 encoding size as the original string. If only mutating
2417-bit ASCII strings (or only mutating within fixed encoding sizes such
242as Cyrillic->Cyrillic) then no gc will occur.
243
244{{string?}}, {{string=?}}, {{string-append}}, all R5RS string
245comparisons, and {{read-line}} are unmodified.
246
247Regular expression matching will be just as fast except in the case of
248Unicode character classes (which were not possible before anyway).
249
250All other procedures incur zero to minor overhead, but keep the same
251asymptotic performance.
252
253==== Discussion
254
255There are two ways to add Unicode string support to an existing
256language: redefine the strings themselves (i.e. add a new string
257type), or redefine the operations on the strings. The former causes a
258schism in your string libraries, dividing them between Unicode-aware
259and not, either doubling your library implementations or limiting them
260to one type or the other. You can't freely pass strings to other
261libraries without keeping track of their types and converting when
262needed. It becomes slow and unwieldy. C and Perl are the only
263language I know of who seriously tried this. In Perl the modules
264which worked with Unicode strings were minimal, frequent type
265conversions were needed, a general mess ensued, and Perl very quickly
266switched to the latter approach. In C as well, the libraries
267supporting wchar are still minimal, while most libraries still only
268support char.
269
270UTF-8 is ideal for the in-place sort of extension because it is
271backwards compatible with ASCII. Any ASCII (7-bit) byte found within
272a UTF-8 string is guaranteed to be that character, not part of a
273multibyte character, so parsing libraries that work on ASCII
274characters work unmodified. This includes most existing text formats
275and network protocols. The EUC (Extended Unix Code) encodings also
276have this feature so a similar module could be implemented allowing
277users to {{(require 'euc-jp)}} for example and work in Japanese EUC
278rather than Unicode. Other encodings such as Shift_JIS satisfy the
279requirement that an ASCII string has the same meaning in the encoding,
280but multibyte characters in the encoding may include ASCII bytes,
281breaking the rule we need for safe ASCII parsing. A few encodings
282like UTF-16 and UTF-32 are completely incompatible. UTF-16 is
283primarily only used these days by Java, a victim of the unfortunate
284fact that at first UTF-16 was fixed width but is no longer with the
285advent of surrogate pairs. Note that even without this module you can
286write source code in Chicken in any ASCII compatible encoding like
287ISO-8859-* or UTF-8 and define symbols with that encoding (letting you
288replace lambda with syntax for a real greek lambda, for example).
289
290Other languages that use UTF-8 include Perl, Python, TCL. XML and
291increasingly more and more network standards are using UTF-8 by
292default, and major databases all support UTF-8. Libraries with UTF-8
293support include Gtk, SDL, and freetype.
294
295=== Changelog
296
297; 3.3.0 ;
298; 3.2.0 ;
299; 3.1.0 ;
300; 3.0.0 ; Hello
301
302=== License
303
304Copyright (c) 2004-2008, Alex Shinn
305All rights reserved.
306
307Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
308conditions are met:
309
310  Redistributions of source code must retain the above copyright notice, this list of conditions and the following
311.  disclaimer.
312  Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
313.  disclaimer in the documentation and/or other materials provided with the distribution.
314  Neither the name of the author nor the names of its contributors may be used to endorse or promote
315.  products derived from this software without specific prior written permission.
316
317THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
318OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
319AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
320CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
321CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
322SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
323THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
324OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
325POSSIBILITY OF SUCH DAMAGE.
Note: See TracBrowser for help on using the repository browser.