Html-parsing omits end tag omission?

The package html-parsing was recommended to me as the most accurate html parsing library in Racket. However, it doesn't appear to handle implicit end tags, which is common in a lot of html. For example, the p tags below shouldn't be nested because the closing <\p> tag is not required:

> (with-input-from-string "<html><p>foo<p>bar</html>"
        (lambda () (html->xexp (current-input-port))))
'(*TOP* (html (p "foo" (p "bar"))))

Am I missing something here? Is there a way to handle end tag omission in html-parsing? According to the code, html-parsing does appear to constrain the possible parents of an element, but I don't think that will be much help here.

More info on tag omission can be found in the description of the paragraph element here:

It occurred to me that there could be more to this than my simple example illustrates, since when parsing html from a file I don't always see nested p's. Here is another example from a real file that I parsed:

   (head "\n" (title "Gloom's Armor Guide") "\n")
    (@ (bgcolor "#000000") (text "#c0c0c0") (link "#ffff00") (vlink "#ff8000"))
    (h1 (@ (align "Center")) "GLOOMS GUIDE TO ARMOR\r\n")
    (h2 (@ (align "Center")) "version 1.0 5/4/96\r\n")
     "More than you ever wanted to know about Armor.  Based on information\n"
     "originally provided by Cyper, Trachten, Taluk, and others.  Much of this\n"
     "information was provided during the ice age.  I have done my best to convert\r\n"
     "it to the new system but there may be some mistakes. Please contact me at\n"
     (a (@ (href "")) "")
     " if you have\n"
     "a comment, question or correction.\r\n")
     "Below is a list of the different armor types available in GSIII.Critical\r\n"
     "Range is the roll needed by your opponet to cause a critical hit. Training\r\n"
     "is the amount of armor training you need to avoid RT penalties. Every 20\r\n"
     "points of armor training trains away one second of RT.\r\n"
     (h3 (@ (align "Center")) (b "Armor Groups (AG)   \tCritical Range ") "\r\n"))

The h3 on the last line shouldn't be nested within a paragraph, as that isn't valid and the paragraph should have been closed implicitly. However, the paragraph above that one also doesn't have a closing </p> tag, but it is handled correctly.

1 Like

This looks like a bug to me. You should email Neil to report the problem.

Note: here's a link to the specification of the paragraph tag:

1 Like

Neil updated html-parsing to fix my issue. Version 8.0 has the fix.

His fix was to enumerate the allowed parent elements for the h1-6 tags, and to add html to the list of allowed parents for p. The end tag omission is, in fact, handled via the parent constraints. When encountering a new element, the parser searches the list of currently open elements to find an allowed parent and closes any intervening elements if necessary to meet the constraints.