Opened 9 years ago

Closed 9 years ago

Last modified 7 years ago

#396 closed defect (fixed)

Handle empty path components in uri-generic/uri-common

Reported by: sjamaan Owned by: sjamaan
Priority: minor Milestone: 4.9.0
Component: extensions Version: 4.6.x
Keywords: uri, pedantic strictness, rfc, standards Cc: raikov
Estimated difficulty:

Description

A quick check of the spec seems to indicate that "http://foo.com/bar//qux" is a valid URI and should be parsed into #(URI scheme=http authority=#(URIAuth host="foo.com" port=#f) path=(/ "bar" "" "qux") query=#f fragment=#f) (ie with an empty string in between foo and bar in the record's "path" slot)

Do some more investigation and add testcases, fixing if the current behaviour of discarding empty components is incorrect.

Change History (14)

comment:1 Changed 9 years ago by sjamaan

comment:2 Changed 9 years ago by Ivan Raikov

What part of the RFC exactly talks about this?

comment:3 Changed 9 years ago by sjamaan

None, that's the problem; it is suspiciously silent about it. See also the analysis of the Apache bugreport.

Because the grammar allows empty paths (and our parser accepts it, which is fine), but the RFC does not mention any normalization or discarding of empty path components, it would be logical to keep empty path components around. It would also be a generalization of what we do with trailing slashes, so it might clean up our code too.

I didn't create this ticket to complain; it was just a reminder for myself. If you want to handle it that's fine, but I intended to work on it later today or tomorrow.

comment:4 in reply to:  3 ; Changed 9 years ago by Ivan Raikov

I don't think you are complaining, I just wondered where this issue is addressed in the RFC, since the ticket description refers to a "quick check of the spec", and there is no test case that illustrates a specific problem. It is perfectly possible that this behavior causes some breakage somewhere, especially if it is undefined in the RFC.

Replying to sjamaan:

None, that's the problem; it is suspiciously silent about it. See also the analysis of the
Apache bugreport.

comment:5 in reply to:  4 ; Changed 9 years ago by sjamaan

Replying to iraikov:

I don't think you are complaining, I just wondered where this issue is addressed in the RFC, since the ticket description refers to a "quick check of the spec", and there is no test case that illustrates a specific problem. It is perfectly possible that this behavior causes some breakage somewhere, especially if it is undefined in the RFC.

Which behaviour do you mean?

I just spent an hour implementing a nasty workaround to encode empty path components for a project at work because Apache discards empty path components. It would be nice if our implementation kept them around. I've also seen bugreports elsewhere that browsers keep them around so a relative reference generated on a server that's unaware of the extra slash breaks on the browser because it does understand the extra slash (and .. would strip off one extra slash).

I think discarding empty components causes more trouble than keeping them. I'm sure keeping them would also cause trouble with some other applications, but only if they generated wrong URLs in the first place. The user of the uri lib can always decide to discard them (Spiffy's file handler could do this, for example so that requests for files in the filesystem would work with any number of slashes, but it could keep around the empty strings in the pathinfo part so web applications can cleanly represent any string there without jumping through hoops like I had to do with Apache)

comment:6 Changed 9 years ago by sjamaan

BTW, Lighttpd keeps them around too

comment:7 in reply to:  5 ; Changed 9 years ago by Ivan Raikov

I meant the discarding-of-empty-path-segments behavior. Some actual test cases that illustrate why this is a problem. Now that I think about it, SMB file sharing URIs are also very insistent about keeping a large number of consecutive slashes, so I would not be surprised if the current uri-generic breaks those. And as I wrote this, I just realized that the uri-generic test cases do not include any SMB URIs, so we need to fix this, regardless of the outcome of this ticket.

It does sound like keeping the empty path segments would be a sane behavior. Backward compatibility should not be a big issue, if any user of the library wanted to write their own normalization procedure that removes the empty segments, they can call update-uri and filter the path accordingly.

Replying to sjamaan:

Which behaviour do you mean?

I just spent an hour implementing a nasty workaround to encode empty path components for a >project at work because Apache discards empty path components. It would be nice if our implementation kept them around.

comment:8 in reply to:  7 Changed 9 years ago by sjamaan

Replying to iraikov:

I meant the discarding-of-empty-path-segments behavior. Some actual test cases that illustrate why this is a problem.

Phew! Glad you agree :)

Now that I think about it, SMB file sharing URIs are also very insistent about keeping a large number of consecutive slashes, so I would not be surprised if the current uri-generic breaks those. And as I wrote this, I just realized that the uri-generic test cases do not include any SMB URIs, so we need to fix this, regardless of the outcome of this ticket.

Could you come up with some particularly nasty examples? I have never really used SMB.
I'm not sure having a lot of real-world examples matters that much, it's mostly the edge cases that are interesting for the tests. If we nail those down, the run-of-the-mill URIs will take care of themselves. But having a few example SMB uris would be good (just put them in the path-cases list. Be sure to update the comment too since these wouldn't be Python cases :P).

It does sound like keeping the empty path segments would be a sane behavior. Backward compatibility should not be a big issue, if any user of the library wanted to write their own normalization procedure that removes the empty segments, they can call update-uri and filter the path accordingly.

Agreed!

comment:9 Changed 9 years ago by sjamaan

Resolution: fixed
Status: newclosed

Oh, fun: I had actually caused this bug myself! (I just didn't remember)

The changeset was [12721], but it doesn't have any associated tests or rationale why I did this, so I'll assume it was a brainfart.

I've now committed a fix for this bug and added testcases for it in [20462].

I'm closing the ticket, but if you want to still add those real-world SMB testcases, feel free to do so.

comment:10 in reply to:  9 Changed 9 years ago by Ivan Raikov

Great, thanks for the testcases, so in the future we can refer to them after we have completely forgotten about this issue and vehemently argue about it again ;-) As for SMB, there is no official document that is specifically treating SMB URIs and their relationship to POSIX and UNC paths, but there is a candidate RFC, which I will look at more closely this weekend, and add an relevant examples from there as testcases for uri-generic.

Replying to sjamaan:

Oh, fun: I had actually caused this bug myself! (I just didn't remember)

The changeset was [12721], but it doesn't have any associated tests or rationale why I did this, so I'll assume it was a brainfart.

I've now committed a fix for this bug and added testcases for it in [20462].

I'm closing the ticket, but if you want to still add those real-world SMB testcases, feel free to do so.

comment:11 Changed 9 years ago by sjamaan

Resolution: fixed
Status: closedreopened

Reopening, as there's still at least one instance of this bug left:

(uri->string (uri-relative-to
	       (uri-reference "../blah")
	       (uri-reference "http://foo.com/bar/foo///")))
=> "http://foo.com/bar/blah"
;; Should be:
"http://foo.com/bar/foo//blah"

We have a few tests that test the current behaviour under extra-cases. There's no mention of a standard or rationale so I guess these were just added because it seemed like a good idea at the time ;)

comment:12 Changed 9 years ago by sjamaan

Resolution: fixed
Status: reopenedclosed

This was fixed and released as version 2.35 last week.

comment:13 Changed 9 years ago by felix winkelmann

Milestone: 4.7.04.8.0

Milestone 4.7.0 deleted

comment:14 Changed 7 years ago by felix winkelmann

Milestone: 4.8.04.9.0

Milestone 4.8.0 deleted

Note: See TracTickets for help on using tickets.