Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3034

Detector always returns text/plain when scanning Mathematica files

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Blocker
    • Resolution: Unresolved
    • Affects Version/s: 1.23
    • Fix Version/s: 1.23
    • Component/s: detector
    • Labels:

      Description

      We are working with Tika to implement our mime types detection module. The library seemingly cannot detect Mathematica files although the documentation confirmed it does [1]. The Tika detector always returns `text/plain` instead of `application/mathematica` as described in the documentation as well as unit tests [2].

      By doing the same need with Python code as below, we can obtain the right mime types for any Mathematica file downloaded from the Wolfram Library Archive [3]

      #!/usr/bin/python3
      import mimetypes, os, sys
      test_file = sys.argv[1]
      print(mimetypes.MimeTypes().guess_type(test_file)[0])
      

      Therefore, we suspected there is a bug in Tika detector where it tries to guess mime types for Mathematica files.

      Also, there is an existing ticket asking for the implementation of Mathematica file detector. Here it is: https://issues.apache.org/jira/browse/TIKA-1520

      References:

       [1] https://tika.apache.org/1.23/formats.html

       [2] https://github.com/apache/tika/blob/master/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java#L64

       [3] https://library.wolfram.com/infocenter/Courseware/4706/

       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              nvntung@gmail.com Tung Nguyen
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: