source: project/utf8/utf8.html @ 7833

Last change on this file since 7833 was 7833, checked in by Kon Lovett, 12 years ago

Rmvd read/write-byte def since now in extras. Added note about variable redefines during utf8-srfi-13 compile.

File size: 12.7 KB
Line 
1<html><head><title>utf8</title>
2
3<style type="text/css">
4  <!--
5      CODE {
6             color: #666666;
7           }
8      EM {
9           font-weight: bold;
10           font-style: normal;
11         }
12      DT.function { 
13                    background: #f5f5f5;
14                    color: black;
15                    padding: 0.1em;
16                    border: 1px solid #bbbaaf;
17                    font-family: monospace;
18                  }
19      PRE {
20        background: #efeee0;
21        padding: 0.1em;
22        border: 1px solid #bbbaaf;
23      }
24    -->
25</style></head>
26  <body>
27
28<center><img src="egg.jpg"></center>
29<center><a href="index.html">back</a></center>
30
31<h2>utf8</h2>
32
33<h3>Description:</h3>
34
35Unicode support
36
37<h3>Author:</h3>
38Alex Shinn
39
40<h3>Version:</h3>
41<ul>
42<li>1.13 Removed read/write-byte definition. [Kon Lovett]</li>
43<li>1.12 </li>
44<li>1.11 Fixed misspelled variables, '\' in string, added imports/exports [Kon Lovett]</li>
45<li>1.1
46Fixed platform-issue with case-map [Thanks to Kon Lovett and Alex Shinn]; seperated
47SRFI-13 and SRFI-14 parts into separate extensions</li>
48<li>1.0</li>
49</ul>
50
51<h3>Usage:</h3>
52<pre>(require-extension utf8)
53</pre>
54
55<h3>Requires:</h3>
56<a href="iset.html"><code>iset</code></a> and <a href="syntax-case.html">syntax-case</a>
57
58<h3>Download:</h3>
59<a href="utf8.egg">utf8.egg</a>
60
61<h3>Documentation:</h3>
62
63<p>USAGE
64
65<p>To make your code Unicode aware, just do the following:
66
67<pre>
68  (require-extension utf8)
69  (module mymodule ()
70    (import utf8)
71
72    ... ; unicode-aware code
73
74    )
75</pre>
76
77<p>then all core, extra and regex string operations will be
78Unicode aware.  <code>string-length</code> will return the number of codepoints,
79not the number of bytes, <code>string-ref</code> will index by codepoints and
80return a char with an integer value up to 2^21, regular expressions
81will match single codepoints rather than bytes and understand Unicode
82character classes, etc.
83
84<p>Strings are still native strings and may be passed to external
85libraries (either Scheme or foreign) perfectly safely.  Libraries
86that do parsing invariably do so on ASCII character boundaries and
87are thus guaranteed to be compatible.  Libraries that reference
88strings by index would need to be modified with a UTF-8 version.
89Currently all existing eggs are UTF-8 safe to my knowledge.
90
91<p>Alternately, you may import utf8 at the top-level:
92
93<pre>
94  ; require modules using byte-semantics
95  ...
96  (require-extension utf8)
97  (import utf8)
98  ...
99  ; require modules using utf8-semantics
100  ...
101  ; unicode-aware code
102</pre>
103
104<p>By importing directly into the top-level, any subsequently loaded
105code will also use Unicode-aware semantics, even if it was not
106written with Unicode in mind.  This is more powerful but slightly
107less safe, since third party units may make assumptions about
108character ranges or string size.
109
110<p>To use Unicode-aware SRFI-13 and SRFI-14 using UTF-8 semantics:
111
112<pre>
113  (require-extension utf8-srfi-13)
114  (module ()
115    (import utf8-srfi-13)
116
117    ... ; unicode-aware SRFI-13
118    )
119
120  (require-extension utf8-srfi-14)
121  (module ()
122    (import utf8-srfi-14)
123
124     ... ; unicode-capable SRFI-14
125    )
126</pre>
127
128<p>The SRFI-14 module provides an alternative to the standard Chicken SRFI-14.
129As a pure superset which handles arbitrary-sized characters it should
130be usable as a drop-in replacement.  The only aspect related to UTF-8
131is STRING->CHAR-SET assumes the string is UTF-8 encoded.
132
133<p>UNICODE CHAR-SETS
134
135<p>The default SRFI-14 char-sets are defined using ASCII-only
136characters, since this is both useful and lighter-weight.  To obtain
137full Unicode char-set definitions, use the char-set unit:
138
139<pre>
140  (require-extension char-set)
141</pre>
142
143<p>The following char-sets are provided based on the Unicode properties:
144
145<pre>
146  char-set:alphabetic
147  char-set:arabic
148  char-set:armenian
149  char-set:ascii-hex-digit
150  char-set:bengali
151  char-set:bidi-control
152  char-set:bopomofo
153  char-set:braille
154  char-set:buhid
155  char-set:canadian-aboriginal
156  char-set:cherokee
157  char-set:common
158  char-set:cypriot
159  char-set:cyrillic
160  char-set:dash
161  char-set:default-ignorable-code-point
162  char-set:deprecated
163  char-set:deseret
164  char-set:devanagari
165  char-set:diacritic
166  char-set:ethiopic
167  char-set:extender
168  char-set:georgian
169  char-set:gothic
170  char-set:grapheme-base
171  char-set:grapheme-extend
172  char-set:grapheme-link
173  char-set:greek
174  char-set:gujarati
175  char-set:gurmukhi
176  char-set:han
177  char-set:hangul
178  char-set:hanunoo
179  char-set:hebrew
180  char-set:hex-digit
181  char-set:hiragana
182  char-set:hyphen
183  char-set:id-continue
184  char-set:id-start
185  char-set:ideographic
186  char-set:ids-binary-operator
187  char-set:ids-trinary-operator
188  char-set:inherited
189  char-set:join-control
190  char-set:kannada
191  char-set:katakana
192  char-set:katakana-or-hiragana
193  char-set:khmer
194  char-set:lao
195  char-set:latin
196  char-set:limbu
197  char-set:linear-b
198  char-set:logical-order-exception
199  char-set:lowercase
200  char-set:malayalam
201  char-set:math
202  char-set:mongolian
203  char-set:myanmar
204  char-set:noncharacter-code-point
205  char-set:ogham
206  char-set:old-italic
207  char-set:oriya
208  char-set:osmanya
209  char-set:quotation-mark
210  char-set:radical
211  char-set:runic
212  char-set:shavian
213  char-set:sinhala
214  char-set:soft-dotted
215  char-set:sterm
216  char-set:syriac
217  char-set:tagalog
218  char-set:tagbanwa
219  char-set:tai-le
220  char-set:tamil
221  char-set:telugu
222  char-set:terminal-punctuation
223  char-set:thaana
224  char-set:thai
225  char-set:tibetan
226  char-set:ugaritic
227  char-set:unified-ideograph
228  char-set:uppercase
229  char-set:variation-selector
230  char-set:white-space
231  char-set:xid-continue
232  char-set:xid-start
233  char-set:yi
234</pre>
235
236<p>UNICODE CASE-MAPPINGS
237
238<p>The SRFI-13 case-mapping procedures (<code>string-upcase</code>, etc.) are defined
239using only ASCII case-mappings, since this is both useful and
240lighter-weight.  To get full Unicode aware case-mappings, do
241
242<pre>
243  (require-extension case-map)
244</pre>
245
246<p>which provides the upcase, downcase, and titlecase procedures.  These
247take a first argument of either a string or port, and an optional
248second argument of locale (as a string), returning the appropriate
249locale-aware case-mapped string.
250
251
252<p>BYTE-STRINGS
253
254<p>Sometimes you may need access to the original string primitives so
255you can directly access bytes, such as if you were implementing your
256own regex library or text buffer and wanted optimal performance.  For
257these cases we have renamed the original primitives by replacing
258<code>string</code> with <code>byte-string</code>.  Thus <code>byte-string-length</code> is the length
259in bytes, not characters, of the strings (the equivalent of Gauche's
260<code>string-size</code>).  <code>byte-string-set!</code> can corrupt the UTF-8 encoding and
261should be used sparingly if at all.
262
263
264<p>LOW LEVEL API
265
266<p>Direct manipulation of the utf8 encoding is factored away in the
267utf8-lolevel unit.  This includes an abstract string-pointer API, and
268an analogous string-pointer implementation for ASCII strings in the
269string-pointer unit, however as the API is not fixed you use these at
270your own risk.
271
272<p>LIMITATIONS
273
274<p><code>peek-char</code> currently does not have Unicode semantics (i.e. it peeks
275only a single byte) to avoid problems with port buffering.
276
277<p>char-sets are not interchangeable between the existing srfi-14 code
278and Unicode code (i.e. do not pass a Unicode char-set to an external
279library that directly uses the old srfi-14).
280
281
282<p>PERFORMANCE
283
284<p><code>string-length, string-ref</code> and <code>string-set!</code> are all O(n) operations as
285opposed to the usual O(1) since UTF-8 is a variable width encoding.
286Use of these should be discouraged - it is much cleaner to use the
287high-level SRFI-13 procedures and string ports.  For examples of how
288to do common idioms without these procedures look at any string-based
289code in Gauche.
290
291<p>Furthermore, <code>string-set!</code> and other procedures that modify strings in
292place may invoke gc if the mutated result does not fit within the
293same UTF-8 encoding size as the original string.  If only mutating
2947-bit ASCII strings (or only mutating within fixed encoding sizes
295such as Cyrillic->Cyrillic) then no gc will occur.
296
297<p><code>string?, string=?, string-append</code>, all R5RS string comparisons, and
298<code>read-line</code> are unmodified.
299
300<p>Regular expression matching will be just as fast except in the case
301of Unicode character classes (which were not possible before anyway).
302
303<p>All other procedures incur zero to minor overhead, but keep the same
304asymptotic performance.
305
306
307<p>DISCUSSION
308
309<p>There are two ways to add Unicode string support to an existing
310language: redefine the strings themselves (i.e. add a new string
311type), or redefine the operations on the strings.  The former causes
312a schism in your string libraries, dividing them between
313Unicode-aware and not, either doubling your library implementations
314or limiting them to one type or the other.  You can't freely pass
315strings to other libraries without keeping track of their types and
316converting when needed.  It becomes slow and unwieldy.  C and Perl
317are the only language I know of who seriously tried this.  In Perl
318the modules which worked with Unicode strings were minimal, frequent
319type conversions were needed, a general mess ensued, and Perl very
320quickly switched to the latter approach.  In C as well, the libraries
321supporting wchar are still minimal, while most libraries still only
322support char.
323
324<p>UTF-8 is ideal for the in-place sort of extension because it is
325backwards compatible with ASCII.  Any ASCII (7-bit) byte found within
326a UTF-8 string is guaranteed to be that character, not part of a
327multibyte character, so parsing libraries that work on ASCII
328characters work unmodified.  This includes most existing text formats
329and network protocols.  The EUC (Extended Unix Code) encodings also
330have this feature so a similar module could be implemented allowing
331users to (require 'euc-jp) for example and work in Japanese EUC
332rather than Unicode.  Other encodings such as Shift_JIS satisfy the
333requirement that an ASCII string has the same meaning in the
334encoding, but multibyte characters in the encoding may include ASCII
335bytes, breaking the rule we need for safe ASCII parsing.  A few
336encodings like UTF-16 and UTF-32 are completely incompatible.  UTF-16
337is primarily only used these days by Java, a victim of the
338unfortunate fact that at first UTF-16 was fixed with but is no longer
339with the advent of surrogate pairs.  Note that even without this
340module you can write source code in Chicken in any ASCII compatible
341encoding like ISO-8859-* or UTF-8 and define symbols with that
342encoding (letting you replace lambda with syntax for a real greek
343lambda, for example).
344
345<p>Other languages that use UTF-8 include Perl, Python, TCL.  XML and
346increasingly more and more network standards are using UTF-8 by
347default, and major databases all support UTF-8.  Libraries with UTF-8
348support include Gtk, SDL, and freetype.
349
350
351<p>COMPILATION NOTE
352
353<p>Do not be alarmed by the flurry of warning messages about
354"redefinition of imported variable `utf8-srfi-13$$...' from `utf8"
355during compilation of 'utf8-srfi-13.scm'. The messages are the
356result of the utf8.setup-info exports listing. The basic extension
357is named 'utf8' and is used by 'utf8-srfi-13'. The list of all exports
358for the entire package is also named 'utf8'. So the variables appear
359to be defined by the 'utf8' extension.</p>
360
361
362<h3>License:</h3>
363
364<pre>
365Copyright (c) 2004-2005, Alex Shinn
366All rights reserved.
367
368Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following
369conditions are met:
370
371  Redistributions of source code must retain the above copyright notice, this list of conditions and the following
372    disclaimer.
373  Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
374    disclaimer in the documentation and/or other materials provided with the distribution.
375  Neither the name of the author nor the names of its contributors may be used to endorse or promote
376    products derived from this software without specific prior written permission.
377
378THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
379OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY
380AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR
381CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
382CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
383SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
384THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
385OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
386POSSIBILITY OF SUCH DAMAGE.
387</pre>
388
389
390<hr><a href="index.html">back</a>
391
392
393</body></html>
Note: See TracBrowser for help on using the repository browser.