Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2064

Document type detected incorrectly for Stata datasets (.dta extension)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.13
    • None
    • detector
    • None

    Description

      The content type of Stata datasets (created using http://www.stata.com software) is incorrectly detected as `text/html` by Tika. I have tested this using the latest release of Tika, v1.13:

      ```
      $ curl -O http://www.stata-press.com/data/r14/auto.dta
      $ java -jar tika-app-1.13.jar --detect auto.dta
      text/html
      ```

      I believe that the type should instead be `application/octet-stream` (or the equivalent).

      I originally reported this bug downstream (at https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to report upstream to Tika. In addition to the one I downloaded using `curl` in my example, a variety of reference Stata datasets are posted here: http://www.stata-press.com/data/r14/dmain.html

      Attachments

        1. stata_test_data.dta
          1 kB
          Michael Stepner

        Activity

          People

            Unassigned Unassigned
            michaelstepner Michael Stepner
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: