source: project/utf8/utf8.html @ 7850

Last change on this file since 7850 was 7850, checked in by Kon Lovett, 12 years ago

Split into sep installed extn. Stops the "redefined variable" msgs from utf8-srfi-13.

File size: 12.3 KB
Line 
1<html><head><title>utf8</title>
2
3<style type="text/css">
4  <!--
5      CODE {
6             color: #666666;
7           }
8      EM {
9           font-weight: bold;
10           font-style: normal;
11         }
12      DT.function { 
13                    background: #f5f5f5;
14                    color: black;
15                    padding: 0.1em;
16                    border: 1px solid #bbbaaf;
17                    font-family: monospace;
18                  }
19      PRE {
20        background: #efeee0;
21        padding: 0.1em;
22        border: 1px solid #bbbaaf;
23      }
24    -->
25</style></head>
26  <body>
27
28<center><img src="egg.jpg"></center>
29<center><a href="index.html">back</a></center>
30
31<h2>utf8</h2>
32
33<h3>Description:</h3>
34
35Unicode support
36
37<h3>Author:</h3>
38Alex Shinn
39
40<h3>Version:</h3>
41<ul>
42<li>1.14 Split modules into separate extensions. [Kon Lovett]</li>
43<li>1.13 Removed read/write-byte definition. [Kon Lovett]</li>
44<li>1.12 </li>
45<li>1.11 Fixed misspelled variables, '\' in string, added imports/exports [Kon Lovett]</li>
46<li>1.1
47Fixed platform-issue with case-map [Thanks to Kon Lovett and Alex Shinn]; seperated
48SRFI-13 and SRFI-14 parts into separate extensions</li>
49<li>1.0</li>
50</ul>
51
52<h3>Usage:</h3>
53<pre>(require-extension utf8)
54</pre>
55
56<h3>Requires:</h3>
57<a href="iset.html"><code>iset</code></a> and <a href="syntax-case.html">syntax-case</a>
58
59<h3>Download:</h3>
60<a href="utf8.egg">utf8.egg</a>
61
62<h3>Documentation:</h3>
63
64<p>USAGE
65
66<p>To make your code Unicode aware, just do the following:
67
68<pre>
69  (require-extension utf8)
70  (module mymodule ()
71    (import utf8)
72
73    ... ; unicode-aware code
74
75    )
76</pre>
77
78<p>then all core, extra and regex string operations will be
79Unicode aware.  <code>string-length</code> will return the number of codepoints,
80not the number of bytes, <code>string-ref</code> will index by codepoints and
81return a char with an integer value up to 2^21, regular expressions
82will match single codepoints rather than bytes and understand Unicode
83character classes, etc.
84
85<p>Strings are still native strings and may be passed to external
86libraries (either Scheme or foreign) perfectly safely.  Libraries
87that do parsing invariably do so on ASCII character boundaries and
88are thus guaranteed to be compatible.  Libraries that reference
89strings by index would need to be modified with a UTF-8 version.
90Currently all existing eggs are UTF-8 safe to my knowledge.
91
92<p>Alternately, you may import utf8 at the top-level:
93
94<pre>
95  ; require modules using byte-semantics
96  ...
97  (require-extension utf8)
98  (import utf8)
99  ...
100  ; require modules using utf8-semantics
101  ...
102  ; unicode-aware code
103</pre>
104
105<p>By importing directly into the top-level, any subsequently loaded
106code will also use Unicode-aware semantics, even if it was not
107written with Unicode in mind.  This is more powerful but slightly
108less safe, since third party units may make assumptions about
109character ranges or string size.
110
111<p>To use Unicode-aware SRFI-13 and SRFI-14 using UTF-8 semantics:
112
113<pre>
114  (require-extension utf8-srfi-13)
115  (module ()
116    (import utf8-srfi-13)
117
118    ... ; unicode-aware SRFI-13
119    )
120
121  (require-extension utf8-srfi-14)
122  (module ()
123    (import utf8-srfi-14)
124
125     ... ; unicode-capable SRFI-14
126    )
127</pre>
128
129<p>The SRFI-14 module provides an alternative to the standard Chicken SRFI-14.
130As a pure superset which handles arbitrary-sized characters it should
131be usable as a drop-in replacement.  The only aspect related to UTF-8
132is STRING->CHAR-SET assumes the string is UTF-8 encoded.
133
134<p>UNICODE CHAR-SETS
135
136<p>The default SRFI-14 char-sets are defined using ASCII-only
137characters, since this is both useful and lighter-weight.  To obtain
138full Unicode char-set definitions, use the char-set unit:
139
140<pre>
141  (require-extension char-set)
142</pre>
143
144<p>The following char-sets are provided based on the Unicode properties:
145
146<pre>
147  char-set:alphabetic
148  char-set:arabic
149  char-set:armenian
150  char-set:ascii-hex-digit
151  char-set:bengali
152  char-set:bidi-control
153  char-set:bopomofo
154  char-set:braille
155  char-set:buhid
156  char-set:canadian-aboriginal
157  char-set:cherokee
158  char-set:common
159  char-set:cypriot
160  char-set:cyrillic
161  char-set:dash
162  char-set:default-ignorable-code-point
163  char-set:deprecated
164  char-set:deseret
165  char-set:devanagari
166  char-set:diacritic
167  char-set:ethiopic
168  char-set:extender
169  char-set:georgian
170  char-set:gothic
171  char-set:grapheme-base
172  char-set:grapheme-extend
173  char-set:grapheme-link
174  char-set:greek
175  char-set:gujarati
176  char-set:gurmukhi
177  char-set:han
178  char-set:hangul
179  char-set:hanunoo
180  char-set:hebrew
181  char-set:hex-digit
182  char-set:hiragana
183  char-set:hyphen
184  char-set:id-continue
185  char-set:id-start
186  char-set:ideographic
187  char-set:ids-binary-operator
188  char-set:ids-trinary-operator
189  char-set:inherited
190  char-set:join-control
191  char-set:kannada
192  char-set:katakana
193  char-set:katakana-or-hiragana
194  char-set:khmer
195  char-set:lao
196  char-set:latin
197  char-set:limbu
198  char-set:linear-b
199  char-set:logical-order-exception
200  char-set:lowercase
201  char-set:malayalam
202  char-set:math
203  char-set:mongolian
204  char-set:myanmar
205  char-set:noncharacter-code-point
206  char-set:ogham
207  char-set:old-italic
208  char-set:oriya
209  char-set:osmanya
210  char-set:quotation-mark
211  char-set:radical
212  char-set:runic
213  char-set:shavian
214  char-set:sinhala
215  char-set:soft-dotted
216  char-set:sterm
217  char-set:syriac
218  char-set:tagalog
219  char-set:tagbanwa
220  char-set:tai-le
221  char-set:tamil
222  char-set:telugu
223  char-set:terminal-punctuation
224  char-set:thaana
225  char-set:thai
226  char-set:tibetan
227  char-set:ugaritic
228  char-set:unified-ideograph
229  char-set:uppercase
230  char-set:variation-selector
231  char-set:white-space
232  char-set:xid-continue
233  char-set:xid-start
234  char-set:yi
235</pre>
236
237<p>UNICODE CASE-MAPPINGS
238
239<p>The SRFI-13 case-mapping procedures (<code>string-upcase</code>, etc.) are defined
240using only ASCII case-mappings, since this is both useful and
241lighter-weight.  To get full Unicode aware case-mappings, do
242
243<pre>
244  (require-extension case-map)
245</pre>
246
247<p>which provides the upcase, downcase, and titlecase procedures.  These
248take a first argument of either a string or port, and an optional
249second argument of locale (as a string), returning the appropriate
250locale-aware case-mapped string.
251
252
253<p>BYTE-STRINGS
254
255<p>Sometimes you may need access to the original string primitives so
256you can directly access bytes, such as if you were implementing your
257own regex library or text buffer and wanted optimal performance.  For
258these cases we have renamed the original primitives by replacing
259<code>string</code> with <code>byte-string</code>.  Thus <code>byte-string-length</code> is the length
260in bytes, not characters, of the strings (the equivalent of Gauche's
261<code>string-size</code>).  <code>byte-string-set!</code> can corrupt the UTF-8 encoding and
262should be used sparingly if at all.
263
264
265<p>LOW LEVEL API
266
267<p>Direct manipulation of the utf8 encoding is factored away in the
268utf8-lolevel unit.  This includes an abstract string-pointer API, and
269an analogous string-pointer implementation for ASCII strings in the
270string-pointer unit, however as the API is not fixed you use these at
271your own risk.
272
273<p>LIMITATIONS
274
275<p><code>peek-char</code> currently does not have Unicode semantics (i.e. it peeks
276only a single byte) to avoid problems with port buffering.
277
278<p>char-sets are not interchangeable between the existing srfi-14 code
279and Unicode code (i.e. do not pass a Unicode char-set to an external
280library that directly uses the old srfi-14).
281
282
283<p>PERFORMANCE
284
285<p><code>string-length, string-ref</code> and <code>string-set!</code> are all O(n) operations as
286opposed to the usual O(1) since UTF-8 is a variable width encoding.
287Use of these should be discouraged - it is much cleaner to use the
288high-level SRFI-13 procedures and string ports.  For examples of how
289to do common idioms without these procedures look at any string-based
290code in Gauche.
291
292<p>Furthermore, <code>string-set!</code> and other procedures that modify strings in
293place may invoke gc if the mutated result does not fit within the
294same UTF-8 encoding size as the original string.  If only mutating
2957-bit ASCII strings (or only mutating within fixed encoding sizes
296such as Cyrillic->Cyrillic) then no gc will occur.
297
298<p><code>string?, string=?, string-append</code>, all R5RS string comparisons, and
299<code>read-line</code> are unmodified.
300
301<p>Regular expression matching will be just as fast except in the case
302of Unicode character classes (which were not possible before anyway).
303
304<p>All other procedures incur zero to minor overhead, but keep the same
305asymptotic performance.
306
307
308<p>DISCUSSION
309
310<p>There are two ways to add Unicode string support to an existing
311language: redefine the strings themselves (i.e. add a new string
312type), or redefine the operations on the strings.  The former causes
313a schism in your string libraries, dividing them between
314Unicode-aware and not, either doubling your library implementations
315or limiting them to one type or the other.  You can't freely pass
316strings to other libraries without keeping track of their types and
317converting when needed.  It becomes slow and unwieldy.  C and Perl
318are the only language I know of who seriously tried this.  In Perl
319the modules which worked with Unicode strings were minimal, frequent
320type conversions were needed, a general mess ensued, and Perl very
321quickly switched to the latter approach.  In C as well, the libraries
322supporting wchar are still minimal, while most libraries still only
323support char.
324
325<p>UTF-8 is ideal for the in-place sort of extension because it is
326backwards compatible with ASCII.  Any ASCII (7-bit) byte found within
327a UTF-8 string is guaranteed to be that character, not part of a
328multibyte character, so parsing libraries that work on ASCII
329characters work unmodified.  This includes most existing text formats
330and network protocols.  The EUC (Extended Unix Code) encodings also
331have this feature so a similar module could be implemented allowing
332users to (require 'euc-jp) for example and work in Japanese EUC
333rather than Unicode.  Other encodings such as Shift_JIS satisfy the
334requirement that an ASCII string has the same meaning in the
335encoding, but multibyte characters in the encoding may include ASCII
336bytes, breaking the rule we need for safe ASCII parsing.  A few
337encodings like UTF-16 and UTF-32 are completely incompatible.  UTF-16
338is primarily only used these days by Java, a victim of the
339unfortunate fact that at first UTF-16 was fixed with but is no longer
340with the advent of surrogate pairs.  Note that even without this
341module you can write source code in Chicken in any ASCII compatible
342encoding like ISO-8859-* or UTF-8 and define symbols with that
343encoding (letting you replace lambda with syntax for a real greek
344lambda, for example).
345
346<p>Other languages that use UTF-8 include Perl, Python, TCL.  XML and
347increasingly more and more network standards are using UTF-8 by
348default, and major databases all support UTF-8.  Libraries with UTF-8
349support include Gtk, SDL, and freetype.
350
351
352<h3>License:</h3>
353
354<pre>
355Copyright (c) 2004-2005, Alex Shinn
356All rights reserved.
357
358Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
359conditions are met:
360
361  Redistributions of source code must retain the above copyright notice, this list of conditions and the following
362    disclaimer.
363  Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
364    disclaimer in the documentation and/or other materials provided with the distribution.
365  Neither the name of the author nor the names of its contributors may be used to endorse or promote
366    products derived from this software without specific prior written permission.
367
368THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
369OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
370AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
371CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
372CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
373SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
374THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
375OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
376POSSIBILITY OF SUCH DAMAGE.
377</pre>
378
379
380<hr><a href="index.html">back</a>
381
382
383</body></html>
Note: See TracBrowser for help on using the repository browser.