Opened 2 years ago
Last modified 2 years ago
#1805 assigned defect
`html->sxml` with escaped quotes breaks text into multiple nodes
Reported by: | Jeremy Steward | Owned by: | Alex Shinn |
---|---|---|---|
Priority: | minor | Milestone: | someday |
Component: | extensions | Version: | 5.3.0 |
Keywords: | Cc: | ||
Estimated difficulty: |
Description
There's some weirdness with escaping quotes in text when using html->sxml
. Perhaps a short example would be sufficient to explain the problem I'm encountering:
(html->sxml "<p>foo'bar"baz</p>") ;=> (*TOP* (p "foo" "'" "bar" "\"" "baz"))
As a counter-example, I'll use the ssax egg:
(call-with-input-string "<p>foo'bar"baz</p>") ;=> (*TOP* (p "foo'bar\"baz"))
I guess fundamentally it's a question of whether there should be one text node or not. I would argue that in this particular case, it should be a single node. I have been using html-parser to try and scrape some web pages, and this is extremely unexpected! Especially so if one uses txpath
/ sxpath
on the final result, as //p/text()
queries will not necessarily behave as expected. You would have to (apply string-append ((txpath "//p/text()") sxml))
to the result to get the full contents of the text.
Is there a rationale for this, or is that some kind of limitation of the parser? I know that tags may also contain sub-tags in HTML, but I'm not sure a new node should be made if a tag's contents are not HTML tags themselves.
Fixed SSAX example: