Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-225

[PATCH] Various bugfixes for MIME detection

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.4
    • 0.4
    • mime
    • None

    Description

      Here's a patch that solves the following issues:

      • text/plain's priority is too high. The BOMs are also used by XML so it must be ensured that text/plain is not found too soon.
      • *.xsl, *.xslt and *.xsd are not text/plain but they are actually XML files. XSLT has its own MIME type.
      • Consolidated the two XHTML entries.
      • Fixed a bug in the existing XML magics which cause plain XML files to be detected as text/plain.
      • Added magics for UTF-16 encoding. (Some magics are still missing: http://www.w3.org/TR/xml/#sec-guessing)
      • Added entry for XSLT
      • XML namespace detection didn't work if namespace prefixes are used (Examples: XSLT Stylesheets or SVG graphics). Corrected this by adding an additional detection step that fires up an XML parser to determine the root element. Of course, this could probably be done without an XML parser but I had limited time available.
      • Added a test case for some files (test files in separate ZIP, to be placed under tika-core\src\test\resources\org\apache\tika\mime)

      HTH

      Attachments

        1. detection-bugfixes.diff
          15 kB
          Jeremias Maerki
        2. test-files.zip
          3 kB
          Jeremias Maerki

        Activity

          People

            jukkaz Jukka Zitting
            jeremias@apache.org Jeremias Maerki
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: