Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#998 closed defect (invalid)

uri->string / make-uri path encoding inconsistencies

Reported by: andyjpb Owned by: sjamaan
Priority: major Milestone: someday
Component: unknown Version: 4.8.x
Keywords: Cc:
Estimated difficulty:

Description (last modified by sjamaan)

(use uri-common)

#;11> (uri->string (uri-reference "./5:123"))
"./5:123"

Correct!

#;12> (uri->string (make-uri path: '("5:123")))
"5%3A123"

Incorrect!

#;13> (uri->string (make-uri path: '("." "5:123")))
"./5%3A123"

Incorrect!

make-uri appears to have its own path encoder which is encoding an overly broad set of characters. It also seems to lack the logic for consing "./" onto paths that have a colon in the first part.

Change History (7)

comment:1 Changed 11 years ago by sjamaan

Description: modified (diff)
Owner: set to sjamaan
Status: newaccepted

comment:2 Changed 11 years ago by sjamaan

I'm unsure but this appears to be correct. The fact that the original string is read/write invariant is a feature specifically made so that non-HTTP URIs keep their exact encoding, which makes it easier for applications to extract the original "generic" URI from the object in unmodified form.

When generating (or updating the component), these characters get encoded. It is *extremely* unclear from the spec what should happen in this case.

According to RFC3986 (URI), section 2.2:
"URI producing applications should percent-encode data octets that

correspond to characters in the reserved set unless these characters
are specifically allowed by the URI scheme to represent data in that
component."

and section 2.3:
"URIs that differ in the replacement of an unreserved character with

its corresponding percent-encoded US-ASCII octet are equivalent:
they identify the same resource."

Coupled with RFC2616 (HTTP/1.1) section 3.2.3:

"Characters other than those in the "reserved" and "unsafe" sets (see

RFC 2396 [42]) are equivalent to their ""%" HEX HEX" encoding."

Besides the fact that "unsafe" is not even declared in that RFC (which is the 3986 predecessor), I interpret this to mean that special characters are to be treated as special, and implementations should be as conservative as possible, and percent-encode all these other characters. This means that "./5:123", "5%3A123" and "./5%3A123" are all distinct URIs which should be differentiated on the server-side. There's no sane choice to be made except to just encode everything that isn't 100% safe.

uri-generic, on the other hand, does _not_ encode anything except the slash, because it explicitly puts more control into the user's hands and allows the user to determine which of these three paths from "./5:123", "5%3A123" and "./5%3A123" he wants. In that sense, uri-generic is more low-level and therefore allows more fine-grained control.

comment:3 Changed 11 years ago by sjamaan

Long story short: the current behaviour is emphatically not incorrect as you so boldly stated. The current behavior could be up for debate, but the issues involved are rather intricate and generally tricky so I'd rather not change it unless it really causes too much trouble.

The fundamental problem is that uri-common fully decodes all its components (this is stated in the manual), which is a lossy conversion. When putting back together the components they're kept as-is (to prevent this lossage) unless you've supplied them yourself, in which case they are always encoded in the most conservative way.

Perhaps this is fundamentally wrong; for example, it's quite possible that some frameworks would accept "foo[]" as a query key named "foo" which must be an array whereas "foo%5b%5d" would be a query key named "foo[]". Strictly speaking, I believe this is allowed by the spec. If this turns out to be a major problem, I might decide to deprecate uri-common altogether unless a clean solution can be cooked up.

comment:4 Changed 11 years ago by sjamaan

Also note that the final "foo[]" vs "foo%5b%5d" example I gave proves that the current behaviour is at least as correct as can be given the circumstances: any random string used as a key in the query alist will be acceptable and interpreted as-is: the string "foo[]" will arrive on the server as "foo[]", rather than as "foo" indicating array notation (which is not what was intended).

Of course, you're in trouble if you really *do* intend "foo[]" to end up in the final URI string as-is, but then you probably need to define your own layer over uri-generic which has this additional feature (it still needs to be able to encode the key "foo[]" as well as indicate the key "foo" needs to be an array).

Not fully-decoding everything (and leaving it to the user) isn't an option either, because that convenience kind of is the whole point that uri-common even exists!

comment:5 Changed 11 years ago by sjamaan

Resolution: invalid
Status: acceptedclosed

As discussed at length on IRC, this is indeed the designed behavior. I'll think if it's possible and sane to parameterize which special characters to encode and which not to encode.

comment:6 Changed 11 years ago by andyjpb

The sjamaan!

The time you put into explaining this is much appreciated.
I promise to tell everybody about this (#975).

comment:7 Changed 11 years ago by andyjpb

Thx sjamaan!

Note: See TracTickets for help on using tickets.