Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.13
-
None
-
None
Description
The content type of Stata datasets (created using http://www.stata.com software) is incorrectly detected as `text/html` by Tika. I have tested this using the latest release of Tika, v1.13:
```
$ curl -O http://www.stata-press.com/data/r14/auto.dta
$ java -jar tika-app-1.13.jar --detect auto.dta
text/html
```
I believe that the type should instead be `application/octet-stream` (or the equivalent).
I originally reported this bug downstream (at https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to report upstream to Tika. In addition to the one I downloaded using `curl` in my example, a variety of reference Stata datasets are posted here: http://www.stata-press.com/data/r14/dmain.html