Context Navigation

← Previous Ticket
Next Ticket →

#1182 closed enhancement (fixed)

utf8 egg silently accepts invalid byte sequences

Reported by:	Moritz Heidkamp	Owned by:	Alex Shinn
Priority:	major	Milestone:	someday
Component:	extensions	Version:	4.9.x
Keywords:	utf8	Cc:
Estimated difficulty:

Description

I noticed that some procedures of the utf8 egg silently accept invalid byte sequences. This might have some safety implications, e.g. consider this case (the procedures used are the core versions, procedures from the utf8 egg are prefixed with utf8- in the following code snippets):

(define evil-quote
  (list->string (map integer->char '(#b11000000 #b10100111))))

This is an invalid (overlong) UTF-8 encoding of the ' character. Now a program could perform a check like this to make sure a user supplied string doesn't contain any quotes:

(unless (utf8-string-contains evil-quote "'") ...)

And then go ahead and write it character by character like this:

(utf8-string-for-each display evil-quote)

Which would produce the actual ' character. The same is true for any other procedure that produces characters from strings, e.g. string-ref, string->list, etc.

Any other invalid byte sequence (such as stray continuation bytes) is also silently accepted.

I'm not entirely sure what would be the wisest way to handle this. We could have these procedures signal an error or just mention this behavior in the documentation so that people know to perform validation on untrusted inputs.

Attachments (2)

utf8-validation.diff (4.4 KB) - added by Moritz Heidkamp 10 years ago.
utf8-validation-2.diff (4.2 KB) - added by Moritz Heidkamp 10 years ago.

Download all attachments as: .zip

Change History (15)

comment:1 Changed 10 years ago by Moritz Heidkamp

The section on invalid byte sequences of the Wikipedia article on UTF-8 is quite a nice overview of the general issue.

comment:2 Changed 10 years ago by Alex Shinn

This is intentional - existing chicken code mixes binary strings
and text strings as strings, so we can't in general forbid such
invalid sequences.

We can try to provide sane defaults, and indeed if you use that
definition of evil-quote with utf8 imported, you get a valid
sequence. We absolutely can't do anything about users who
aren't even using the utf8 egg.

What we _can_ (and should) do is provide utilities to check if
a string is valid utf8, and/or strip invalid sequences.

comment:3 Changed 10 years ago by Moritz Heidkamp

Hey Alex,

thanks for your reply!

This is intentional - existing chicken code mixes binary strings
and text strings as strings, so we can't in general forbid such
invalid sequences.

The utf8 egg's procedures surely could detect them, the question is whether that is the wisest way to go about it. But see below.

We can try to provide sane defaults, and indeed if you use that
definition of evil-quote with utf8 imported, you get a valid
sequence.

No, it's an invalid sequence as per the UTF-8 spec both in the Unicode standard and the RFC. See the Wikipedia article -- it is certainly possible to interpret some of them but it's still outside of the spec, thus potentially leading to exploits.

We absolutely can't do anything about users who
aren't even using the utf8 egg.

Sure, I'm only talking about the utf8 egg here -- the core string procedures are defined to operate on the byte level so that's what users get.

What we _can_ (and should) do is provide utilities to check if
a string is valid utf8, and/or strip invalid sequences.

Yep, I think that'd be my preferred solution, too. I've implemented UTF-8 validation the other day which I'd be willing to contribute to the utf8 egg if you like. I have both a Scheme and a C implementation, the latter of which is an order of magnitude faster than the former. Would you care for a patch?

comment:4 follow-up: 5 Changed 10 years ago by Alex Shinn

Maybe I misunderstand, but your code does not generate
an invalid byte sequence for me:

$ csi -R utf8 -p "(list->string (map integer->char '(#b11000000 #b10100111)))"
À§
$ csi -R utf8 -p "(list->string (map integer->char '(#b11000000 #b10100111)))" | hexdump -C
00000000 c3 80 c2 a7 0a |.....|
00000005

These are the characters 00C0;LATIN CAPITAL LETTER A WITH GRAVE
and 00A7;SECTION SIGN corresponding to #b11000000 and #b10100111.

If you find what you think is a bug, please write a full program and attach it,
using "test" to show clearly what you expect and what is different.

comment:5 in reply to: 4 Changed 10 years ago by Moritz Heidkamp

Replying to ashinn:

Maybe I misunderstand, but your code does not generate
an invalid byte sequence for me:

$ csi -R utf8 -p "(list->string (map integer->char '(#b11000000 #b10100111)))"
À§
$ csi -R utf8 -p "(list->string (map integer->char '(#b11000000 #b10100111)))" | hexdump -C
00000000 c3 80 c2 a7 0a |.....|
00000005

That bit is just about constructing the (invalid) byte sequence that is to be fed to the UTF-8 decoder. Note that I mentioned that list->string here is the core procedure, not the one from the utf8 egg.

These are the characters 00C0;LATIN CAPITAL LETTER A WITH GRAVE
and 00A7;SECTION SIGN corresponding to #b11000000 and #b10100111.

Right, those are the Unicode code points represented by these two numbers, which the utf8 egg's list->string procedure properly encodes as a 4 byte UTF-8 sequence. However, as mentioned above, the issue is about a byte sequence c0 a7 (in the form of a CHICKEN string) which is passed to one of the utf8 egg's decoding procedures.

If you find what you think is a bug, please write a full program and attach it,
using "test" to show clearly what you expect and what is different.

Here you go! Since there is no correct value to expect (because there is no way to UTF-8 decode this byte sequence) I am using an inverted test-assert:

(use test (prefix utf8 utf8-))
(test-assert (not (string=? "'" (utf8-list->string (utf8-string->list (list->string (map integer->char '(#b11000000 #b10100111))))))))

comment:6 follow-up: 8 Changed 10 years ago by Alex Shinn

Resolution:	→ invalid
Status:	new → closed

That's not a complete test, and you're using different code now.
You're now implicitly referring to utf8-lolevel procedures. The
meaning of "lolevel" is that it is dangerous, and allows you to
shoot yourself in the foot.

(use utf8) puts the standard procedures in utf8 mode. If you
pass valid inputs to those procedures and get an invalid output
it's a bug, and I will fix it. If you pass invalid inputs, you get
undefined results. Both of your examples are of invalid inputs,
created outside of utf8.

comment:7 Changed 10 years ago by Alex Shinn

Sorry, you were using non-utf8 procedures by renaming,
not by the utf8-lolevel egg, but the result is the same.
You're passing invalid inputs.

comment:8 in reply to: 6 Changed 10 years ago by Moritz Heidkamp

Replying to ashinn:

That's not a complete test,

What's missing?

and you're using different code now.

I was using string-for-each in my inital example to illustrate the general issue but that doesn't lend itself too well for a test so I switched to string->list instead. As both procedures rely on the same UTF-8 decoder internally, the code is essentially equivalent AFAICT.

(use utf8) puts the standard procedures in utf8 mode. If you
pass valid inputs to those procedures and get an invalid output
it's a bug, and I will fix it. If you pass invalid inputs, you get
undefined results. Both of your examples are of invalid inputs,
created outside of utf8.

Yep, that's exactly the point: passing strings that were created without any of the utf8 string constructors. Please also read my second last reply again: I agree with you about preserving the current behavior of the decoder procedures. Instead, we should provide validation procedures for users who need to deal with strings they received from untrusted sources (e.g. from third party libraries which don't use the utf8 procedures).

I think the issue boils down to the fact that the utf8 egg overloads / re-uses the core string type but currently doesn't provide a predicate to check whether a string is actually valid for use with its API.

I hope that clarifies my point :-) So again: Would you be interested in integrating such a validation predicate with the utf8 egg? I think it would belong there but I can also make it a separate egg if you prefer.

comment:9 follow-up: 10 Changed 10 years ago by Alex Shinn

Yes, a validation predicate is a long-standing todo.
I met get around to it soon, patches are also welcome.

It would in theory be possible to validate every input
to every utf8 operation, but I have no intention of doing
so, for performance reasons and because people may
currently be using invalid utf8 in "safe" ways already.

comment:10 in reply to: 9 Changed 10 years ago by Moritz Heidkamp

Resolution:	invalid
Status:	closed → reopened

Replying to ashinn:

Yes, a validation predicate is a long-standing todo.
I met get around to it soon, patches are also welcome.

I'm attaching a patch which adds utf8-validation module along with some rudimentary sanity tests. It only exports the discussed predicate, named utf8-string? -- that's the reason I put it in a separate module, since I couldn't think of a better name and I didn't want to make the main utf8 module un-prefixable. Perhaps you have a better idea?

The validation algorithm is based on The Unicode Standard, Version 7.0 - Core Specification, Table 3-7, p. 125. It performs reasonably well when compiled with -O2 -specialize or -O3 (around an order of magnitude slower than an implementation of the same algorithm in C). I provide it to you under the same license as the utf8 egg so feel free to include it.

It would in theory be possible to validate every input
to every utf8 operation, but I have no intention of doing
so, for performance reasons and because people may
currently be using invalid utf8 in "safe" ways already.

Yep, I totally agree with that!

Changed 10 years ago by Moritz Heidkamp

Attachment:	utf8-validation.diff added

comment:11 Changed 10 years ago by Alex Shinn

Type:	defect → enhancement

The code looks good. How about keeping it in utf8 and just calling it valid-string?

comment:12 Changed 10 years ago by Moritz Heidkamp

Ah yes, very good idea! I'm attaching a new patch which puts the implementation in utf8-lolevel and reexports it as valid-string? from utf8. I've also changed the implementation to re-use the existing utf8-start-byte->length. What do you think?

Changed 10 years ago by Moritz Heidkamp

Attachment:	utf8-validation-2.diff added

comment:13 Changed 10 years ago by Alex Shinn

Resolution:	→ fixed
Status:	reopened → closed

Looks good. I simplified a little and added documentation.
Should be fixed in version 3.4.0.

Note: See TracTickets for help on using tickets.

Download in other formats: