Opened 6 months ago

Last modified 6 months ago

#1805 assigned defect

`html->sxml` with escaped quotes breaks text into multiple nodes

Reported by: Jeremy Steward Owned by: Alex Shinn
Priority: minor Milestone: someday
Component: extensions Version: 5.3.0
Keywords: Cc:
Estimated difficulty:

Description

There's some weirdness with escaping quotes in text when using html->sxml. Perhaps a short example would be sufficient to explain the problem I'm encountering:

(html->sxml "<p>foo&apos;bar&quot;baz</p>") ;=> (*TOP* (p "foo" "'" "bar" "\"" "baz"))

As a counter-example, I'll use the ssax egg:

(call-with-input-string "<p>foo&apos;bar&quot;baz</p>") ;=> (*TOP* (p "foo'bar\"baz"))

I guess fundamentally it's a question of whether there should be one text node or not. I would argue that in this particular case, it should be a single node. I have been using html-parser to try and scrape some web pages, and this is extremely unexpected! Especially so if one uses txpath / sxpath on the final result, as //p/text() queries will not necessarily behave as expected. You would have to (apply string-append ((txpath "//p/text()") sxml)) to the result to get the full contents of the text.

Is there a rationale for this, or is that some kind of limitation of the parser? I know that tags may also contain sub-tags in HTML, but I'm not sure a new node should be made if a tag's contents are not HTML tags themselves.

Change History (1)

comment:1 Changed 6 months ago by Jeremy Steward

Fixed SSAX example:

(call-with-input-string "<p>foo&apos;bar&quot;baz</p>" (cut ssax:xml->sxml <> '())) ;=> (*TOP* (p "foo'bar\"baz"))
Note: See TracTickets for help on using tickets.