Opened 12 years ago
Closed 12 years ago
#1019 closed defect (wontfix)
Different uri-common results for a hairy case
Reported by: | Mario Domenech Goulart | Owned by: | sjamaan |
---|---|---|---|
Priority: | not urgent at all | Milestone: | someday |
Component: | extensions | Version: | 4.8.x |
Keywords: | uri-common | Cc: | |
Estimated difficulty: |
Description
http://blog.lunatech.com/2009/02/03/what-every-web-developer-must-know-about-url-encoding shows an example of a crazy URI and its corresponding parts after parsing. uri-common's uri-reference
seems to parse it differently, producing different results.
Here's the case from the aforementioned blog post:
While this is slightly nuts and "http://example.com/:@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,==?/?:@-._~!$'()*+,;=/?:@-._~!$'()*+,;==#/?:@-._~!$&'()*+,;=" is a valid HTTP URL, this is the standard. For the curious, the previous URL expands to: Part Value Scheme http Host example.com Path /:@-._~!$&'()*+,= Path parameter name :@-._~!$&'()*+, Path parameter value :@-._~!$&'()*+,== Query parameter name /?:@-._~!$'()* ,; Query parameter value /?:@-._~!$'()* ,;== Fragment /?:@-._~!$&'()*+,;=
Here's what uri-common produces for that uri:
(uri-reference "http://example.com/:@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,==?/?:@-._~!$'()*+,;=/?:@-") #<URI-common: scheme=http port=#f host="example.com" path=(/ ":@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,==") query=((|/?:@-._~!$'()* ,| . #t) (|| . "/?:@-")) fragment=#f>
I'm not sure it is a bug in uri-common. Just filing this ticket because I noticed the difference between uri-common's behavior and the results presented by that blog post.
Change History (7)
comment:1 Changed 12 years ago by
Resolution: | → invalid |
---|---|
Status: | new → closed |
comment:2 Changed 12 years ago by
For completeness, here's the way uri-common parses the URI in a slightly more human-readable form, which I think is correct:
full uri http://example.com/:@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,==?/?:@-._~!$'()*+,;=/?:@- scheme http port 80 (implicitly derived via scheme) host example.com path /:@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,== query name 1 /?:@-._~!$'()* , query value 1 (empty: the key is immediately followed by =;) query name 2 (empty: the preceding ; is immediately followed by an = sign) query value 2 /?:@-._~!$'()* , query name 3 (either empty or "=") query value 3 (either empty or "="; this is mutually exclusive with the name due to the ";==" construct) fragment /?:@-._~!$&'()*+,;=
This is when you accept the semicolon as a separator. Otherwise the split would be more similar to what's quoted in the blog post.
It could be argued whether the path should be further split up; this depends on whether the library supports the nonstandard matrix path components. If it does, I think this would be split up trivially like so:
path /:@-._~!$&'()*+,= path parameter name :@-._~!$&'()*+, path parameter value :@-._~!$&'()*+,==
That's identical to what the blog post said, except for the trailing bit of the path parameter value that got chopped off due to the query string parsing.
comment:3 Changed 12 years ago by
Ignore my mutterings about the question mark in the path string, there was none in the blog post. I don't know why, but I must have misread it.
comment:4 Changed 12 years ago by
Resolution: | invalid |
---|---|
Status: | closed → reopened |
comment:6 Changed 12 years ago by
Status: | closed → reopened |
---|
comment:7 Changed 12 years ago by
Resolution: | → wontfix |
---|---|
Status: | reopened → closed |
I don't know what's wrong today, nothing works for me!
According to RFC3986, a path cannot contain a questionmark (which this blog post seems to suggest is allowed within the value part of the path parameter).
Quoting the relevant parts of the BNF:
As you can see, the question mark occurs in
gen-delims
, which itself is a production ofreserved
. However, neither of these two productions can be derived through any of the path productions.As I understand it, that means that the generalised URI syntax doesn't allow unencoded question marks in paths, which means the example is incorrectly split up. Section 3.3 ("Path") even explicitly says:
Perhaps question marks were allowed in URLs by one of the older specs?
Also, there seems to be no formal spec that defines the "path matrix" stuff, only some notes from w3c with the status "personal worldview", and a mention that this isn't implemented and never may be implemented in practice ("this is not a feature of the web, but it could have been"). Anyway, this is a separate issue from the syntax of what's allowed in a path string. The generic URI syntax just defines the path as a hierarchical sequence of opaque strings, which it also does for queries.
The uri-common egg *might* do something with this, like it does with the query. On the one hand this would make it slightly more powerful (as in the Yahoo RESTful API), but I'm not sure the added annoyance of working with paths would be worth it. So few libraries even support this separation that Yahoo was forced to allow query arguments as an alternative, if I understand the docs correctly.