The package html-parsing
was recommended to me as the most accurate html parsing library in Racket. However, it doesn't appear to handle implicit end tags, which is common in a lot of html. For example, the p
tags below shouldn't be nested because the closing <\p>
tag is not required:
> (with-input-from-string "<html><p>foo<p>bar</html>"
(lambda () (html->xexp (current-input-port))))
'(*TOP* (html (p "foo" (p "bar"))))
Am I missing something here? Is there a way to handle end tag omission in html-parsing
? According to the code, html-parsing
does appear to constrain the possible parents of an element, but I don't think that will be much help here.
More info on tag omission can be found in the description of the paragraph element here:
EDIT:
It occurred to me that there could be more to this than my simple example illustrates, since when parsing html from a file I don't always see nested p
's. Here is another example from a real file that I parsed:
'(*TOP*
(html
(head "\n" (title "Gloom's Armor Guide") "\n")
(body
(@ (bgcolor "#000000") (text "#c0c0c0") (link "#ffff00") (vlink "#ff8000"))
(h1 (@ (align "Center")) "GLOOMS GUIDE TO ARMOR\r\n")
(h2 (@ (align "Center")) "version 1.0 5/4/96\r\n")
(p
(hr)
"More than you ever wanted to know about Armor. Based on information\n"
"\n"
"originally provided by Cyper, Trachten, Taluk, and others. Much of this\n"
"\n"
"information was provided during the ice age. I have done my best to convert\r\n"
"it to the new system but there may be some mistakes. Please contact me at\n"
"\n"
(a (@ (href "mailto:DJLV66A@prodigy.com")) "DJLV66A@prodigy.com")
" if you have\n"
"a comment, question or correction.\r\n")
(p
"Below is a list of the different armor types available in GSIII.Critical\r\n"
"Range is the roll needed by your opponet to cause a critical hit. Training\r\n"
"is the amount of armor training you need to avoid RT penalties. Every 20\r\n"
"points of armor training trains away one second of RT.\r\n"
(h3 (@ (align "Center")) (b "Armor Groups (AG) \tCritical Range ") "\r\n"))
The h3
on the last line shouldn't be nested within a paragraph, as that isn't valid and the paragraph should have been closed implicitly. However, the paragraph above that one also doesn't have a closing </p>
tag, but it is handled correctly.