Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3034

Detector always returns text/plain when scanning Mathematica files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Blocker
    • Resolution: Unresolved
    • 1.23
    • 1.23
    • detector

    Description

      We are working with Tika to implement our mime types detection module. The library seemingly cannot detect Mathematica files although the documentation confirmed it does [1]. The Tika detector always returns `text/plain` instead of `application/mathematica` as described in the documentation as well as unit tests [2].

      By doing the same need with Python code as below, we can obtain the right mime types for any Mathematica file downloaded from the Wolfram Library Archive [3]

      #!/usr/bin/python3
      import mimetypes, os, sys
      test_file = sys.argv[1]
      print(mimetypes.MimeTypes().guess_type(test_file)[0])
      

      Therefore, we suspected there is a bug in Tika detector where it tries to guess mime types for Mathematica files.

      Also, there is an existing ticket asking for the implementation of Mathematica file detector. Here it is: https://issues.apache.org/jira/browse/TIKA-1520

      References:

       [1] https://tika.apache.org/1.23/formats.html

       [2] https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64

       [3] https://library.wolfram.com/infocenter/Courseware/4706/

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            nvntung@gmail.com Tung Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: