Tika
  1. Tika
  2. TIKA-478

HtmlParser can emit <head> elements inside of <body> block

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 0.7
    • Fix Version/s: 0.8
    • Component/s: parser
    • Labels:
      None

      Description

      The change to fix TIKA-379 causes a premature generation of HTML <head> element and nested <title> by the XHTMLContentHandler being used by HtmlHandler.

      This then creates issues for any downstream content handler, as they can then get an empty <title> element, and <meta> elements inside of the <body> element (which is invalid).

      1. TIKA-478-2.patch
        1 kB
        Ken Krugler
      2. TIKA-478.patch
        12 kB
        Ken Krugler

        Issue Links

          Activity

          Ken Krugler created issue -
          Hide
          Jukka Zitting added a comment -

          Oh, I see now where this problem with <meta/> elements is coming from.

          One reasonably clean way to solve this would be to disable the output of <meta/> elements from HtmlHandler while keeping the code that sets the respective Metadata entries. Then in XHTMLContentHandler we'd modify the lazyStartDocument() method to output not just the <title/> element but the full set of collected metadata as <meta/> elements. We could also set the lang attribute (or xml:lang?) of the <html/> element if the respective Metadata entry is set.

          The nice thing about this solution would be that the inclusion of metadata in <head/> would work also for other document types beyond HTML.

          Show
          Jukka Zitting added a comment - Oh, I see now where this problem with <meta/> elements is coming from. One reasonably clean way to solve this would be to disable the output of <meta/> elements from HtmlHandler while keeping the code that sets the respective Metadata entries. Then in XHTMLContentHandler we'd modify the lazyStartDocument() method to output not just the <title/> element but the full set of collected metadata as <meta/> elements. We could also set the lang attribute (or xml:lang?) of the <html/> element if the respective Metadata entry is set. The nice thing about this solution would be that the inclusion of metadata in <head/> would work also for other document types beyond HTML.
          Ken Krugler made changes -
          Field Original Value New Value
          Link This issue relates to TIKA-379 [ TIKA-379 ]
          Hide
          Ken Krugler added a comment -

          Emitting metadata entries inside of the <head> element works for <meta> elements, of course, but not for <link> or <base>. Though it does improve output for non-HTML parsers, so it feels like the right way to at least handle <meta>.

          Show
          Ken Krugler added a comment - Emitting metadata entries inside of the <head> element works for <meta> elements, of course, but not for <link> or <base>. Though it does improve output for non-HTML parsers, so it feels like the right way to at least handle <meta>.
          Hide
          Ken Krugler added a comment -

          I ran into a test case failing when I applied my fix - looks like the epub parser calls XHTMLContentHandler with <head> elements, among others. This triggered the premature emitting of <title>. In order to guard against similar problems with other parsers, I modified XHTMLContentHandler to try to ignore startElement() calls with elements that would be auto-generated.

          Show
          Ken Krugler added a comment - I ran into a test case failing when I applied my fix - looks like the epub parser calls XHTMLContentHandler with <head> elements, among others. This triggered the premature emitting of <title>. In order to guard against similar problems with other parsers, I modified XHTMLContentHandler to try to ignore startElement() calls with elements that would be auto-generated.
          Ken Krugler made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Ken Krugler made changes -
          Attachment TIKA-478.patch [ 12451953 ]
          Hide
          Ken Krugler added a comment -

          SVN 984997.

          Normally I'd let a patch like this bake for a while, but it's blocking some work that needs to use Tika trunk for other fixes, so I've pushed it out sooner than usual. Happy to revert if a review uncovers any issues.

          Show
          Ken Krugler added a comment - SVN 984997. Normally I'd let a patch like this bake for a while, but it's blocking some work that needs to use Tika trunk for other fixes, so I've pushed it out sooner than usual. Happy to revert if a review uncovers any issues.
          Ken Krugler made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Ken Krugler added a comment -

          Additional fix for problem with missing closing </body> and </html> tags.

          Show
          Ken Krugler added a comment - Additional fix for problem with missing closing </body> and </html> tags.
          Ken Krugler made changes -
          Attachment TIKA-478-2.patch [ 12451967 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open In Progress In Progress
          6h 2m 1 Ken Krugler 12/Aug/10 22:35
          In Progress In Progress Resolved Resolved
          10m 22s 1 Ken Krugler 12/Aug/10 22:45

            People

            • Assignee:
              Ken Krugler
              Reporter:
              Ken Krugler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development