Opened 9 years ago
Closed 9 years ago
#1182 closed enhancement (fixed)
utf8 egg silently accepts invalid byte sequences
Reported by: | Moritz Heidkamp | Owned by: | Alex Shinn |
---|---|---|---|
Priority: | major | Milestone: | someday |
Component: | extensions | Version: | 4.9.x |
Keywords: | utf8 | Cc: | |
Estimated difficulty: |
Description
I noticed that some procedures of the utf8
egg silently accept invalid byte sequences. This might have some safety implications, e.g. consider this case (the procedures used are the core versions, procedures from the utf8
egg are prefixed with utf8-
in the following code snippets):
(define evil-quote (list->string (map integer->char '(#b11000000 #b10100111))))
This is an invalid (overlong) UTF-8 encoding of the '
character. Now a program could perform a check like this to make sure a user supplied string doesn't contain any quotes:
(unless (utf8-string-contains evil-quote "'") ...)
And then go ahead and write it character by character like this:
(utf8-string-for-each display evil-quote)
Which would produce the actual '
character. The same is true for any other procedure that produces characters from strings, e.g. string-ref
, string->list
, etc.
Any other invalid byte sequence (such as stray continuation bytes) is also silently accepted.
I'm not entirely sure what would be the wisest way to handle this. We could have these procedures signal an error or just mention this behavior in the documentation so that people know to perform validation on untrusted inputs.
Attachments (2)
Change History (15)
comment:1 Changed 9 years ago by
comment:2 Changed 9 years ago by
This is intentional - existing chicken code mixes binary strings
and text strings as strings, so we can't in general forbid such
invalid sequences.
We can try to provide sane defaults, and indeed if you use that
definition of evil-quote with utf8 imported, you get a valid
sequence. We absolutely can't do anything about users who
aren't even using the utf8 egg.
What we _can_ (and should) do is provide utilities to check if
a string is valid utf8, and/or strip invalid sequences.
comment:3 Changed 9 years ago by
Hey Alex,
thanks for your reply!
This is intentional - existing chicken code mixes binary strings
and text strings as strings, so we can't in general forbid such
invalid sequences.
The utf8 egg's procedures surely could detect them, the question is whether that is the wisest way to go about it. But see below.
We can try to provide sane defaults, and indeed if you use that
definition of evil-quote with utf8 imported, you get a valid
sequence.
No, it's an invalid sequence as per the UTF-8 spec both in the Unicode standard and the RFC. See the Wikipedia article -- it is certainly possible to interpret some of them but it's still outside of the spec, thus potentially leading to exploits.
We absolutely can't do anything about users who
aren't even using the utf8 egg.
Sure, I'm only talking about the utf8 egg here -- the core string procedures are defined to operate on the byte level so that's what users get.
What we _can_ (and should) do is provide utilities to check if
a string is valid utf8, and/or strip invalid sequences.
Yep, I think that'd be my preferred solution, too. I've implemented UTF-8 validation the other day which I'd be willing to contribute to the utf8 egg if you like. I have both a Scheme and a C implementation, the latter of which is an order of magnitude faster than the former. Would you care for a patch?
comment:4 follow-up: 5 Changed 9 years ago by
Maybe I misunderstand, but your code does not generate
an invalid byte sequence for me:
$ csi -R utf8 -p "(list->string (map integer->char '(#b11000000 #b10100111)))"
ˤ
$ csi -R utf8 -p "(list->string (map integer->char '(#b11000000 #b10100111)))" | hexdump -C
00000000 c3 80 c2 a7 0a |.....|
00000005
These are the characters 00C0;LATIN CAPITAL LETTER A WITH GRAVE
and 00A7;SECTION SIGN corresponding to #b11000000 and #b10100111.
If you find what you think is a bug, please write a full program and attach it,
using "test" to show clearly what you expect and what is different.
comment:5 Changed 9 years ago by
Replying to ashinn:
Maybe I misunderstand, but your code does not generate
an invalid byte sequence for me:
$ csi -R utf8 -p "(list->string (map integer->char '(#b11000000 #b10100111)))"
ˤ
$ csi -R utf8 -p "(list->string (map integer->char '(#b11000000 #b10100111)))" | hexdump -C
00000000 c3 80 c2 a7 0a |.....|
00000005
That bit is just about constructing the (invalid) byte sequence that is to be fed to the UTF-8 decoder. Note that I mentioned that list->string
here is the core procedure, not the one from the utf8
egg.
These are the characters 00C0;LATIN CAPITAL LETTER A WITH GRAVE
and 00A7;SECTION SIGN corresponding to #b11000000 and #b10100111.
Right, those are the Unicode code points represented by these two numbers, which the utf8
egg's list->string
procedure properly encodes as a 4 byte UTF-8 sequence. However, as mentioned above, the issue is about a byte sequence c0 a7
(in the form of a CHICKEN string) which is passed to one of the utf8
egg's decoding procedures.
If you find what you think is a bug, please write a full program and attach it,
using "test" to show clearly what you expect and what is different.
Here you go! Since there is no correct value to expect (because there is no way to UTF-8 decode this byte sequence) I am using an inverted test-assert
:
(use test (prefix utf8 utf8-)) (test-assert (not (string=? "'" (utf8-list->string (utf8-string->list (list->string (map integer->char '(#b11000000 #b10100111))))))))
comment:6 follow-up: 8 Changed 9 years ago by
Resolution: | → invalid |
---|---|
Status: | new → closed |
That's not a complete test, and you're using different code now.
You're now implicitly referring to utf8-lolevel procedures. The
meaning of "lolevel" is that it is dangerous, and allows you to
shoot yourself in the foot.
(use utf8) puts the standard procedures in utf8 mode. If you
pass valid inputs to those procedures and get an invalid output
it's a bug, and I will fix it. If you pass invalid inputs, you get
undefined results. Both of your examples are of invalid inputs,
created outside of utf8.
comment:7 Changed 9 years ago by
Sorry, you were using non-utf8 procedures by renaming,
not by the utf8-lolevel egg, but the result is the same.
You're passing invalid inputs.
comment:8 Changed 9 years ago by
Replying to ashinn:
That's not a complete test,
What's missing?
and you're using different code now.
I was using string-for-each
in my inital example to illustrate the general issue but that doesn't lend itself too well for a test so I switched to string->list
instead. As both procedures rely on the same UTF-8 decoder internally, the code is essentially equivalent AFAICT.
(use utf8) puts the standard procedures in utf8 mode. If you
pass valid inputs to those procedures and get an invalid output
it's a bug, and I will fix it. If you pass invalid inputs, you get
undefined results. Both of your examples are of invalid inputs,
created outside of utf8.
Yep, that's exactly the point: passing strings that were created without any of the utf8
string constructors. Please also read my second last reply again: I agree with you about preserving the current behavior of the decoder procedures. Instead, we should provide validation procedures for users who need to deal with strings they received from untrusted sources (e.g. from third party libraries which don't use the utf8
procedures).
I think the issue boils down to the fact that the utf8
egg overloads / re-uses the core string type but currently doesn't provide a predicate to check whether a string is actually valid for use with its API.
I hope that clarifies my point :-) So again: Would you be interested in integrating such a validation predicate with the utf8
egg? I think it would belong there but I can also make it a separate egg if you prefer.
comment:9 follow-up: 10 Changed 9 years ago by
Yes, a validation predicate is a long-standing todo.
I met get around to it soon, patches are also welcome.
It would in theory be possible to validate every input
to every utf8 operation, but I have no intention of doing
so, for performance reasons and because people may
currently be using invalid utf8 in "safe" ways already.
comment:10 Changed 9 years ago by
Resolution: | invalid |
---|---|
Status: | closed → reopened |
Replying to ashinn:
Yes, a validation predicate is a long-standing todo.
I met get around to it soon, patches are also welcome.
I'm attaching a patch which adds utf8-validation
module along with some rudimentary sanity tests. It only exports the discussed predicate, named utf8-string?
-- that's the reason I put it in a separate module, since I couldn't think of a better name and I didn't want to make the main utf8
module un-prefixable. Perhaps you have a better idea?
The validation algorithm is based on The Unicode Standard, Version 7.0 - Core Specification, Table 3-7, p. 125. It performs reasonably well when compiled with -O2 -specialize
or -O3
(around an order of magnitude slower than an implementation of the same algorithm in C). I provide it to you under the same license as the utf8
egg so feel free to include it.
It would in theory be possible to validate every input
to every utf8 operation, but I have no intention of doing
so, for performance reasons and because people may
currently be using invalid utf8 in "safe" ways already.
Yep, I totally agree with that!
Changed 9 years ago by
Attachment: | utf8-validation.diff added |
---|
comment:11 Changed 9 years ago by
Type: | defect → enhancement |
---|
The code looks good. How about keeping it in utf8 and just calling it valid-string?
comment:12 Changed 9 years ago by
Ah yes, very good idea! I'm attaching a new patch which puts the implementation in utf8-lolevel
and reexports it as valid-string?
from utf8
. I've also changed the implementation to re-use the existing utf8-start-byte->length
. What do you think?
Changed 9 years ago by
Attachment: | utf8-validation-2.diff added |
---|
comment:13 Changed 9 years ago by
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Looks good. I simplified a little and added documentation.
Should be fixed in version 3.4.0.
The section on invalid byte sequences of the Wikipedia article on UTF-8 is quite a nice overview of the general issue.