Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-286

HtmlParser calls characters() with post-body data before processing the terminating body element.

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: 0.4
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      Using this example data:

      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
             "http://www.w3.org/TR/html4/strict.dtd">
      <html lang="en">
      <head>
      	<meta http-equiv="content-type" content="text/html; charset=utf-8">
      	<title>Untitled</title>
      	<base href="http://newdomain.com">
      </head>
      <body>
      
      <a href="link" target="_blank">link1</a>
      <a href="http://domain.com/link" target="_blank">link2</a>
      
      </body>
      </html>
      

      The handler's characters() method gets called with the following text

      Untitled
      \n\n
      link1
      \n
      link2
      \n\n
      \n
      \n

      The first six calls make sense to me.

      The last two calls (with a single \n) happen just before endElement("body") is called, and this is unexpected.

      From the offset in the buffer, passed to characters(), these are the return after the </body> tag. If I put any number of returns in between the </body> and </html>, they all get passed to characters() before the endElement("body") call.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              kkrugler Kenneth William Krugler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: