Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1365

Incorrectly MimeType detection for Apache Lucene web site

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.5
    • 1.8
    • detector
    • None

    Description

      Tika 1.5 detect many page from apache lucene web site as xml, for example this page
      http://lucene.apache.org/core/discussion.html

      Here are error log:, it failed to parse becuase it use xml parser

      Apache Tika was unable to parse the document
      at http://lucene.apache.org/core/discussion.html.

      The full exception stack trace is included below:

      org.apache.tika.exception.TikaException: XML parse error
      at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
      at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:320)
      at org.apache.tika.gui.TikaGUI.openURL(TikaGUI.java:293)
      at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:247)
      at javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:2018)

      Attachments

        1. discussion.html
          15 kB
          Tien Nguyen Manh

        Activity

          People

            chrismattmann Chris A. Mattmann
            tiennm Tien Nguyen Manh
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: