Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1808

Head section closed too eager

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 1.11
    • None
    • parser
    • None

    Description

      XHTMLContentHandler has some logic that closes the head section too early, or this is a problem in TagSoup. In this [1] case a <div> element appears in the head, causing the head to be closed. Subsequent <head> elements do not appear in custom ContentHandlers so i cannot read the document's title, or any other meta tags.

      It can be fixed by using a custom HTMLSchema in the ParseContext, e.g. schema.elementType("div", HTMLSchema.M_EMPTY, 65535, 0); but this isn't really an elegant solution.

      [1] http://www.aljazeera.com/news/2015/05/150516182251747.html

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              markus17 Markus Jelsma
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: