Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2064

Document type detected incorrectly for Stata datasets (.dta extension)

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.13
    • Fix Version/s: None
    • Component/s: detector
    • Labels:
      None

      Description

      The content type of Stata datasets (created using http://www.stata.com software) is incorrectly detected as `text/html` by Tika. I have tested this using the latest release of Tika, v1.13:

      ```
      $ curl -O http://www.stata-press.com/data/r14/auto.dta
      $ java -jar tika-app-1.13.jar --detect auto.dta
      text/html
      ```

      I believe that the type should instead be `application/octet-stream` (or the equivalent).

      I originally reported this bug downstream (at https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to report upstream to Tika. In addition to the one I downloaded using `curl` in my example, a variety of reference Stata datasets are posted here: http://www.stata-press.com/data/r14/dmain.html

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              michaelstepner Michael Stepner
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated: