Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3833

bzip2 MIME type is detected as bzip instead when using tika-core

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.1
    • 2.5.0
    • core
    • None

    Description

      Hello, I'm having a bit of a problem when using the tika-core module (v2.4.1).
      I am trying to detect the MIME type of a bzip2 file and, instead of
      application/x-bzip2, I am getting application/x-bzip. I believe it has
      something to do with the mime-type definitions in the
      tika-mimetypes.xml file.

      <mime-type type="application/x-bzip">
        <magic priority="40">
          <match value="BZh" type="string" offset="0"/>
        </magic>
        <glob pattern="*.bz"/>
        <glob pattern="*.tbz"/>
      </mime-type>
      <mime-type type="application/x-bzip2">
        <sub-class-of type="application/x-bzip"/>
        <_comment>Bzip 2 UNIX Compressed File</_comment>
        <magic priority="40">
          <match value="\x42\x5a\x68\x39\x31" type="string" offset="0"/>
        </magic>
        <glob pattern="*.bz2"/>
        <glob pattern="*.tbz2"/>
        <glob pattern="*.boz"/>
      </mime-type>

      The priority for these is set to 40, I believe that the priority of
      application/x-bzip2 should be higher, because string value "BZh" and
      hex value part "\x42\x5a\x68" are equal. x42\x5a\x68 = BZh.

      Maybe I am missing something here? Does this look like a bug or this
      works as intended? Maybe I can provide some sort of hint for the
      default detector?

      A small example in Scala:

      import org.apache.tika.config.TikaConfig
      import org.apache.tika.detect.DefaultProbDetector
      import org.apache.tika.metadata.{Metadata, TikaCoreProperties}
      
      import java.io.{BufferedInputStream, File, FileInputStream}
      
      object AAA {
        def main(args: Array[String]): Unit = {
          val config = TikaConfig.getDefaultConfig
      
          val file = new File("/home/ekazakas/test.csv.bz2")
          val detector = new DefaultProbDetector()
          val mediaType = detector.detect(new BufferedInputStream(new FileInputStream(file)), new Metadata)
          val mimeType = config.getMimeRepository.forName(mediaType.toString)
          println(mimeType)
        }
      } 

      This prints `application/x-bzip` instead of `application/x-bzip2`.

       

      Attachments

        1. tika-bug.zip
          9 kB
          Eduardas Kazakas

        Activity

          People

            Unassigned Unassigned
            kamiKAZIK Eduardas Kazakas
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: