Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2064

Document type detected incorrectly for Stata datasets (.dta extension)

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.13
    • Fix Version/s: None
    • Component/s: detector
    • Labels:
      None

      Description

      The content type of Stata datasets (created using http://www.stata.com software) is incorrectly detected as `text/html` by Tika. I have tested this using the latest release of Tika, v1.13:

      ```
      $ curl -O http://www.stata-press.com/data/r14/auto.dta
      $ java -jar tika-app-1.13.jar --detect auto.dta
      text/html
      ```

      I believe that the type should instead be `application/octet-stream` (or the equivalent).

      I originally reported this bug downstream (at https://github.com/laurilehmijoki/s3_website/issues/232), and was advised to report upstream to Tika. In addition to the one I downloaded using `curl` in my example, a variety of reference Stata datasets are posted here: http://www.stata-press.com/data/r14/dmain.html

      1. stata_test_data.dta
        1 kB
        Michael Stepner

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #141 (See https://builds.apache.org/job/tika-2.x/141/)
        TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (nick: rev 443a21e3fb564df9bb1c52f6533bd5da6f5cfcc8)

        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit (nick: rev e58ade381a3e4285eb81d55fb250611e82adbef7)
        • (add) tika-parsers/src/test/resources/test-documents/testStataDTA.txt
        • (add) tika-parsers/src/test/resources/test-documents/testStataDTA.dta
        • (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          Merge changes for TIKA-2064 to 2.x (nick: rev 9f6241161af93c9cefd4ba90342b6834a49dc4b1)
        • (add) tika-test-resources/src/test/resources/test-documents/testStataDTA.dta
        • (delete) tika-parsers/src/test/resources/test-documents/testStataDTA.txt
        • (add) tika-test-resources/src/test/resources/test-documents/testStataDTA.txt
        • (delete) tika-parsers/src/test/resources/test-documents/testStataDTA.dta
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #141 (See https://builds.apache.org/job/tika-2.x/141/ ) TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (nick: rev 443a21e3fb564df9bb1c52f6533bd5da6f5cfcc8) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit (nick: rev e58ade381a3e4285eb81d55fb250611e82adbef7) (add) tika-parsers/src/test/resources/test-documents/testStataDTA.txt (add) tika-parsers/src/test/resources/test-documents/testStataDTA.dta (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java Merge changes for TIKA-2064 to 2.x (nick: rev 9f6241161af93c9cefd4ba90342b6834a49dc4b1) (add) tika-test-resources/src/test/resources/test-documents/testStataDTA.dta (delete) tika-parsers/src/test/resources/test-documents/testStataDTA.txt (add) tika-test-resources/src/test/resources/test-documents/testStataDTA.txt (delete) tika-parsers/src/test/resources/test-documents/testStataDTA.dta
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1099 (See https://builds.apache.org/job/Tika-trunk/1099/)
        TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit (nick: rev 2222fe0ce2e1db633bdcf49bd7b24941374f2033)

        • (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        • (add) tika-parsers/src/test/resources/test-documents/testStataDTA.dta
        • (add) tika-parsers/src/test/resources/test-documents/testStataDTA.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1099 (See https://builds.apache.org/job/Tika-trunk/1099/ ) TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection unit (nick: rev 2222fe0ce2e1db633bdcf49bd7b24941374f2033) (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java (add) tika-parsers/src/test/resources/test-documents/testStataDTA.dta (add) tika-parsers/src/test/resources/test-documents/testStataDTA.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build tika-2.x-windows #45 (See https://builds.apache.org/job/tika-2.x-windows/45/)
        TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (nick: rev 443a21e3fb564df9bb1c52f6533bd5da6f5cfcc8)

        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection (nick: rev e58ade381a3e4285eb81d55fb250611e82adbef7)
        • (add) tika-parsers/src/test/resources/test-documents/testStataDTA.txt
        • (add) tika-parsers/src/test/resources/test-documents/testStataDTA.dta
        • (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          Merge changes for TIKA-2064 to 2.x (nick: rev 9f6241161af93c9cefd4ba90342b6834a49dc4b1)
        • (add) tika-test-resources/src/test/resources/test-documents/testStataDTA.dta
        • (delete) tika-parsers/src/test/resources/test-documents/testStataDTA.txt
        • (delete) tika-parsers/src/test/resources/test-documents/testStataDTA.dta
        • (add) tika-test-resources/src/test/resources/test-documents/testStataDTA.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #45 (See https://builds.apache.org/job/tika-2.x-windows/45/ ) TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (nick: rev 443a21e3fb564df9bb1c52f6533bd5da6f5cfcc8) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-2064 Test Stata DTA files from Michael Stepner, plus detection (nick: rev e58ade381a3e4285eb81d55fb250611e82adbef7) (add) tika-parsers/src/test/resources/test-documents/testStataDTA.txt (add) tika-parsers/src/test/resources/test-documents/testStataDTA.dta (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java Merge changes for TIKA-2064 to 2.x (nick: rev 9f6241161af93c9cefd4ba90342b6834a49dc4b1) (add) tika-test-resources/src/test/resources/test-documents/testStataDTA.dta (delete) tika-parsers/src/test/resources/test-documents/testStataDTA.txt (delete) tika-parsers/src/test/resources/test-documents/testStataDTA.dta (add) tika-test-resources/src/test/resources/test-documents/testStataDTA.txt
        Hide
        michaelstepner Michael Stepner added a comment -

        Hi Nick, I'm happy to dual-license it as Apache License Version 2.0!

        Show
        michaelstepner Michael Stepner added a comment - Hi Nick, I'm happy to dual-license it as Apache License Version 2.0!
        Hide
        gagravarr Nick Burch added a comment -

        Are you happy to dual-license it as Apache License, Version 2.0? We can certainly use a CC0 file (see http://www.apache.org/legal/resolved.html#can-works-placed-in-the-public-domain-be-included-in-apache-products), but it's typically a little bit more record keeping!

        Show
        gagravarr Nick Burch added a comment - Are you happy to dual-license it as Apache License, Version 2.0? We can certainly use a CC0 file (see http://www.apache.org/legal/resolved.html#can-works-placed-in-the-public-domain-be-included-in-apache-products ), but it's typically a little bit more record keeping!
        Hide
        michaelstepner Michael Stepner added a comment -

        Hi Nick,

        I've created a test dataset and attached it here.

        This file was created in Stata 13.1 running on Mac OS X. There are small differences in the format of the files depending on the Stata version and the operating system it's run on. But I don't imagine it's worthwhile to build more tests unless we see a bug with a Stata file created using a different version or OS.

        The code that created the test dataset is:

        ```
        clear all
        set obs 3

        gen byte integers=_n
        gen double reals = sqrt(_n)

        gen fruits = ""
        replace fruits = "apple" in 1
        replace fruits = "banana" in 2
        replace fruits = "cantaloupe" in 3

        save stata_test_data.dta
        ```

        I'd like to release this code and related dataset to the public domain using [CC0](https://creativecommons.org/publicdomain/zero/1.0/). I can also apply any more restrictive license you prefer that makes it easy for you to use the file.

        Show
        michaelstepner Michael Stepner added a comment - Hi Nick, I've created a test dataset and attached it here. This file was created in Stata 13.1 running on Mac OS X. There are small differences in the format of the files depending on the Stata version and the operating system it's run on. But I don't imagine it's worthwhile to build more tests unless we see a bug with a Stata file created using a different version or OS. The code that created the test dataset is: ``` clear all set obs 3 gen byte integers=_n gen double reals = sqrt(_n) gen fruits = "" replace fruits = "apple" in 1 replace fruits = "banana" in 2 replace fruits = "cantaloupe" in 3 save stata_test_data.dta ``` I'd like to release this code and related dataset to the public domain using [CC0] ( https://creativecommons.org/publicdomain/zero/1.0/ ). I can also apply any more restrictive license you prefer that makes it easy for you to use the file.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1098 (See https://builds.apache.org/job/Tika-trunk/1098/)
        TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (nick: rev 3c0abc8ebebbaa54a716fbcd15eaf8057343842f)

        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1098 (See https://builds.apache.org/job/Tika-trunk/1098/ ) TIKA-2064 Mime types, with magic, for mostly-xml Stata DTA files. (nick: rev 3c0abc8ebebbaa54a716fbcd15eaf8057343842f) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        Hide
        gagravarr Nick Burch added a comment -

        Magic added in 3c0abc8eb. No unit tests yet though, we can add them once you get a chance to create some simple test files

        Show
        gagravarr Nick Burch added a comment - Magic added in 3c0abc8eb. No unit tests yet though, we can add them once you get a chance to create some simple test files
        Hide
        gagravarr Nick Burch added a comment -

        If you could, that would be most helpful!

        Show
        gagravarr Nick Burch added a comment - If you could, that would be most helpful!
        Hide
        michaelstepner Michael Stepner added a comment -

        Nick, I've confirmed that the `auto.dta` test file that I referenced is not released under an open license (thanks to Sergiy's response to my question, linked above). I'm happy to create a small .dta file for unit testing.

        Show
        michaelstepner Michael Stepner added a comment - Nick, I've confirmed that the `auto.dta` test file that I referenced is not released under an open license (thanks to Sergiy's response to my question, linked above). I'm happy to create a small .dta file for unit testing.
        Hide
        michaelstepner Michael Stepner added a comment -

        Thanks Nick, glad you figured out that there's a non-generic mime type used for these files.

        I've asked the Stata community about the license of the auto.dta test file, which is the most commonly used test file for Stata datasets. My question is posted here: http://www.statalist.org/forums/forum/general-stata-discussion/general/1354690-license-for-auto-dta

        I could also create a small test DTA for use in unit tests. Do you have a preferred way for me to transfer this file to you?

        In case it is useful in configuring proper detection, the reference spec for Stata DTA files is open and published here: http://www.stata.com/help.cgi?dta

        Show
        michaelstepner Michael Stepner added a comment - Thanks Nick, glad you figured out that there's a non-generic mime type used for these files. I've asked the Stata community about the license of the auto.dta test file, which is the most commonly used test file for Stata datasets. My question is posted here: http://www.statalist.org/forums/forum/general-stata-discussion/general/1354690-license-for-auto-dta I could also create a small test DTA for use in unit tests. Do you have a preferred way for me to transfer this file to you? In case it is useful in configuring proper detection, the reference spec for Stata DTA files is open and published here: http://www.stata.com/help.cgi?dta
        Hide
        gagravarr Nick Burch added a comment -

        From a quick google, `application/x-stata-dta` seems to be what other people are using for these files

        They look to be somewhat xml-based, so proper detection ought not to be too hard to sort out for them

        Do you know the license of the test file you've referenced? And/or have the ability to create a small test DTA file that you can share with us to use in unit tests?

        Show
        gagravarr Nick Burch added a comment - From a quick google, `application/x-stata-dta` seems to be what other people are using for these files They look to be somewhat xml-based, so proper detection ought not to be too hard to sort out for them Do you know the license of the test file you've referenced? And/or have the ability to create a small test DTA file that you can share with us to use in unit tests?

          People

          • Assignee:
            Unassigned
            Reporter:
            michaelstepner Michael Stepner
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development