Tika
  1. Tika
  2. TIKA-457

HTMLParser gets an early </body> event

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8
    • Component/s: parser
    • Labels:
      None

      Description

      I am using the IdentityMapper in the HTMLparser with this simple document:

      <html><head><title> my title </title>
      </head>
      <body>
      <frameset rows=\"20,*\"> 
      <frame src=\"top.html\">
      </frame>
      <frameset cols=\"20,*\">
      <frame src=\"left.html\">
      </frame>
      <frame src=\"invalid.html\"/>
      </frame>
      <frame src=\"right.html\">
      </frame>
      </frameset>
      </frameset>
      </body></html>
      

      Strangely the HTMLHandler is getting a call to endElement on the body BEFORE we reach frameset. As a result the variable bodylevel is decremented back to 0 and the remaining entities are ignored due to the logic implemented in HTMLHandler.

      Any idea?

      1. TIKA-457.patch
        10 kB
        Ken Krugler

        Issue Links

          Activity

          Hide
          Ken Krugler added a comment -

          It's TagSoup that's generating the "interesting" output. Straight from a TagSoup parser (without Tika), the above gives you:

          <?xml version="1.0" encoding="UTF-8"?>
          <html><head><title> my title </title></head><body/><frameset rows="20,*"><frame frameborder="1" scrolling="auto" src="top.html"/><frameset cols="20,*"><frame frameborder="1" scrolling="auto" src="left.html"/><frame frameborder="1" scrolling="auto" src="invalid.html"/><frame frameborder="1" scrolling="auto" src="right.html"/></frameset></frameset></html>
          

          According to the XHTML 1.0 "frameset" DTD and the HTML 4.01 "frameset" DTD, the <frameset> element should NOT be inside of a body tag, which is why you're seeing the odd output.

          I believe the issue here is that based on TagSoup's state machine architecture, the <body> tag has been emitted by the time you get to the <frameset>. TagSoup could hang onto the <body> tag until it sees something other than a <frameset>, but that feels pretty extreme.

          Side note - the HTML is slightly broken, in that <frame src=\"invalid.html\"/> is followed by </frame>, but it's already been terminated by the "/>" sequence. Don't know if that was intentional or not.

          Also strictly speaking you can't have empty <frame> elements, which is what are defined in the example. They should be <frame src="blah"> without a </frame>.

          Show
          Ken Krugler added a comment - It's TagSoup that's generating the "interesting" output. Straight from a TagSoup parser (without Tika), the above gives you: <?xml version= "1.0" encoding= "UTF-8" ?> <html><head><title> my title </title></head><body/><frameset rows= "20,*" ><frame frameborder= "1" scrolling= "auto" src= "top.html" /><frameset cols= "20,*" ><frame frameborder= "1" scrolling= "auto" src= "left.html" /><frame frameborder= "1" scrolling= "auto" src= "invalid.html" /><frame frameborder= "1" scrolling= "auto" src= "right.html" /></frameset></frameset></html> According to the XHTML 1.0 "frameset" DTD and the HTML 4.01 "frameset" DTD, the <frameset> element should NOT be inside of a body tag, which is why you're seeing the odd output. I believe the issue here is that based on TagSoup's state machine architecture, the <body> tag has been emitted by the time you get to the <frameset>. TagSoup could hang onto the <body> tag until it sees something other than a <frameset>, but that feels pretty extreme. Side note - the HTML is slightly broken, in that <frame src=\"invalid.html\"/> is followed by </frame>, but it's already been terminated by the "/>" sequence. Don't know if that was intentional or not. Also strictly speaking you can't have empty <frame> elements, which is what are defined in the example. They should be <frame src="blah"> without a </frame>.
          Hide
          Ken Krugler added a comment -

          This also improves handling of <frame> elements for TIKA-463, by resolving relative URLs in src=xxx attributes for these elements.

          Show
          Ken Krugler added a comment - This also improves handling of <frame> elements for TIKA-463 , by resolving relative URLs in src=xxx attributes for these elements.
          Hide
          Ken Krugler added a comment -

          SVN 985288

          Show
          Ken Krugler added a comment - SVN 985288
          Hide
          Ken Krugler added a comment -

          Just applied patch (SVN 986089) to problem that showed up during testing on larger dataset. Empty value in Metadata was getting emitted as <meta> tag with empty content=xxx attribute, which can cause SAX processing code to throw a NPE.

          Show
          Ken Krugler added a comment - Just applied patch (SVN 986089) to problem that showed up during testing on larger dataset. Empty value in Metadata was getting emitted as <meta> tag with empty content=xxx attribute, which can cause SAX processing code to throw a NPE.

            People

            • Assignee:
              Ken Krugler
              Reporter:
              Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development