Html-parsing omits end tag omission?

The package html-parsing was recommended to me as the most accurate html parsing library in Racket. However, it doesn't appear to handle implicit end tags, which is common in a lot of html. For example, the p tags below shouldn't be nested because the closing <\p> tag is not required:

> (with-input-from-string "<html><p>foo<p>bar</html>"
        (lambda () (html->xexp (current-input-port))))
'(*TOP* (html (p "foo" (p "bar"))))

Am I missing something here? Is there a way to handle end tag omission in html-parsing? According to the code, html-parsing does appear to constrain the possible parents of an element, but I don't think that will be much help here.

More info on tag omission can be found in the description of the paragraph element here:

EDIT:
It occurred to me that there could be more to this than my simple example illustrates, since when parsing html from a file I don't always see nested p's. Here is another example from a real file that I parsed:

'(*TOP*
  (html
   (head "\n" (title "Gloom's Armor Guide") "\n")
   (body
    (@ (bgcolor "#000000") (text "#c0c0c0") (link "#ffff00") (vlink "#ff8000"))
    (h1 (@ (align "Center")) "GLOOMS GUIDE TO ARMOR\r\n")
    (h2 (@ (align "Center")) "version 1.0 5/4/96\r\n")
    (p
     (hr)
     "More than you ever wanted to know about Armor.  Based on information\n"
     "\n"
     "originally provided by Cyper, Trachten, Taluk, and others.  Much of this\n"
     "\n"
     "information was provided during the ice age.  I have done my best to convert\r\n"
     "it to the new system but there may be some mistakes. Please contact me at\n"
     "\n"
     (a (@ (href "mailto:DJLV66A@prodigy.com")) "DJLV66A@prodigy.com")
     " if you have\n"
     "a comment, question or correction.\r\n")
    (p
     "Below is a list of the different armor types available in GSIII.Critical\r\n"
     "Range is the roll needed by your opponet to cause a critical hit. Training\r\n"
     "is the amount of armor training you need to avoid RT penalties. Every 20\r\n"
     "points of armor training trains away one second of RT.\r\n"
     (h3 (@ (align "Center")) (b "Armor Groups (AG)   \tCritical Range ") "\r\n"))

The h3 on the last line shouldn't be nested within a paragraph, as that isn't valid and the paragraph should have been closed implicitly. However, the paragraph above that one also doesn't have a closing </p> tag, but it is handled correctly.

1 Like

This looks like a bug to me. You should email Neil to report the problem.

Note: here's a link to the specification of the paragraph tag: https://html.spec.whatwg.org/multipage/grouping-content.html#the-p-element

1 Like

Neil updated html-parsing to fix my issue. Version 8.0 has the fix.

His fix was to enumerate the allowed parent elements for the h1-6 tags, and to add html to the list of allowed parents for p. The end tag omission is, in fact, handled via the parent constraints. When encountering a new element, the parser searches the list of currently open elements to find an allowed parent and closes any intervening elements if necessary to meet the constraints.

3 Likes