Opened 11 years ago

Closed 11 years ago

#1019 closed defect (wontfix)

Different uri-common results for a hairy case

Reported by: Mario Domenech Goulart Owned by: sjamaan
Priority: not urgent at all Milestone: someday
Component: extensions Version: 4.8.x
Keywords: uri-common Cc:
Estimated difficulty:

Description

http://blog.lunatech.com/2009/02/03/what-every-web-developer-must-know-about-url-encoding shows an example of a crazy URI and its corresponding parts after parsing. uri-common's uri-reference seems to parse it differently, producing different results.

Here's the case from the aforementioned blog post:

While this is slightly nuts and 
"http://example.com/:@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,==?/?:@-._~!$'()*+,;=/?:@-._~!$'()*+,;==#/?:@-._~!$&'()*+,;="
is a valid HTTP URL, this is the standard.

For the curious, the previous URL expands to:

Part	Value
Scheme	http
Host	example.com
Path	/:@-._~!$&'()*+,=
Path parameter name	:@-._~!$&'()*+,
Path parameter value	:@-._~!$&'()*+,==
Query parameter name	/?:@-._~!$'()* ,;
Query parameter value	/?:@-._~!$'()* ,;==
Fragment	/?:@-._~!$&'()*+,;=

Here's what uri-common produces for that uri:

(uri-reference "http://example.com/:@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,==?/?:@-._~!$'()*+,;=/?:@-")

#<URI-common: scheme=http port=#f host="example.com" path=(/ ":@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,==") query=((|/?:@-._~!$'()* ,| . #t) (|| . "/?:@-")) fragment=#f>

I'm not sure it is a bug in uri-common. Just filing this ticket because I noticed the difference between uri-common's behavior and the results presented by that blog post.

Change History (7)

comment:1 Changed 11 years ago by sjamaan

Resolution: invalid
Status: newclosed

According to RFC3986, a path cannot contain a questionmark (which this blog post seems to suggest is allowed within the value part of the path parameter).

Quoting the relevant parts of the BNF:

URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

   hier-part     = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty

   path          = path-abempty    ; begins with "/" or is empty
                 / path-absolute   ; begins with "/" but not "//"
                 / path-noscheme   ; begins with a non-colon segment
                 / path-rootless   ; begins with a segment
                 / path-empty      ; zero characters

   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>

   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

As you can see, the question mark occurs in gen-delims, which itself is a production of reserved. However, neither of these two productions can be derived through any of the path productions.

As I understand it, that means that the generalised URI syntax doesn't allow unencoded question marks in paths, which means the example is incorrectly split up. Section 3.3 ("Path") even explicitly says:

The path is terminated by the first question mark ("?") or number sign ("#") character, or by the end of the URI.

Perhaps question marks were allowed in URLs by one of the older specs?

Also, there seems to be no formal spec that defines the "path matrix" stuff, only some notes from w3c with the status "personal worldview", and a mention that this isn't implemented and never may be implemented in practice ("this is not a feature of the web, but it could have been"). Anyway, this is a separate issue from the syntax of what's allowed in a path string. The generic URI syntax just defines the path as a hierarchical sequence of opaque strings, which it also does for queries.

The uri-common egg *might* do something with this, like it does with the query. On the one hand this would make it slightly more powerful (as in the Yahoo RESTful API), but I'm not sure the added annoyance of working with paths would be worth it. So few libraries even support this separation that Yahoo was forced to allow query arguments as an alternative, if I understand the docs correctly.

comment:2 Changed 11 years ago by sjamaan

For completeness, here's the way uri-common parses the URI in a slightly more human-readable form, which I think is correct:

full uri      http://example.com/:@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,==?/?:@-._~!$'()*+,;=/?:@-

scheme         http
port           80   (implicitly derived via scheme)
host           example.com
path           /:@-._~!$&'()*+,=;:@-._~!$&'()*+,=:@-._~!$&'()*+,==
query name 1   /?:@-._~!$'()* ,
query value 1  (empty: the key is immediately followed by =;)
query name 2   (empty: the preceding ; is immediately followed by an = sign)
query value 2  /?:@-._~!$'()* ,
query name 3   (either empty or "=")
query value 3  (either empty or "="; this is mutually exclusive with the name due to the ";==" construct)
fragment       /?:@-._~!$&'()*+,;=

This is when you accept the semicolon as a separator. Otherwise the split would be more similar to what's quoted in the blog post.

It could be argued whether the path should be further split up; this depends on whether the library supports the nonstandard matrix path components. If it does, I think this would be split up trivially like so:

path                   /:@-._~!$&'()*+,=
path parameter name    :@-._~!$&'()*+,
path parameter value   :@-._~!$&'()*+,==

That's identical to what the blog post said, except for the trailing bit of the path parameter value that got chopped off due to the query string parsing.

comment:3 Changed 11 years ago by sjamaan

Ignore my mutterings about the question mark in the path string, there was none in the blog post. I don't know why, but I must have misread it.

comment:4 Changed 11 years ago by sjamaan

Resolution: invalid
Status: closedreopened

comment:5 Changed 11 years ago by sjamaan

Status: reopenedclosed

OK, different resolution status then...

comment:6 Changed 11 years ago by sjamaan

Status: closedreopened

comment:7 Changed 11 years ago by sjamaan

Resolution: wontfix
Status: reopenedclosed

I don't know what's wrong today, nothing works for me!

Note: See TracTickets for help on using tickets.