Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.14
    • Fix Version/s: None
    • Component/s: mime
    • Labels:
      None

      Description

      As raised at http://stackoverflow.com/questions/41272195/onenote-support-for-apache-tika-parsers, we don't have any magic for the OneNote formats. Several years ago we dug out the file format specs (see http://lucene.472066.n3.nabble.com/Tika-OneNote-Support-td4020393.html), but didn't have volunteer energy to implement a parser. However, armed with those specs, we should be able to come up with some mime magic for detection

      1. note-ssn-test-mmmm.one
        30 kB
        Krishnan Narayan
      2. Sample1.one
        352 kB
        Krishnan Narayan

        Activity

        Hide
        gagravarr Nick Burch added a comment -

        Mime magic now added for `.one` and `.onetoc`. `.onepkg` is actually just a cab file of other onenote files, so we can't add magic for it (it needs detecting by opening the container)

        No unit tests yet, leaving open until we get some small sample files we can use, hopefully from the original poster on StackOverflow!

        Show
        gagravarr Nick Burch added a comment - Mime magic now added for `.one` and `.onetoc`. `.onepkg` is actually just a cab file of other onenote files, so we can't add magic for it (it needs detecting by opening the container) No unit tests yet, leaving open until we get some small sample files we can use, hopefully from the original poster on StackOverflow!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1167 (See https://builds.apache.org/job/Tika-trunk/1167/)
        TIKA-2224 Mime magic for OneNote (nick: rev df14f78e46feeae16cb6cbd2cb40c44ce497d53e)

        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote (nick: rev 009c143aedb95e356b2835b17fd66e9f7aec43d0)
        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-2224 We now differ from HTTPD on onenote formats, as we have (nick: rev 9546bd31953a10704e54fd40ebac68b2138e3aa2)
        • (edit) tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1167 (See https://builds.apache.org/job/Tika-trunk/1167/ ) TIKA-2224 Mime magic for OneNote (nick: rev df14f78e46feeae16cb6cbd2cb40c44ce497d53e) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote (nick: rev 009c143aedb95e356b2835b17fd66e9f7aec43d0) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-2224 We now differ from HTTPD on onenote formats, as we have (nick: rev 9546bd31953a10704e54fd40ebac68b2138e3aa2) (edit) tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
        Hide
        keshy Krishnan Narayan added a comment -

        PLease find sample file attached.

        Show
        keshy Krishnan Narayan added a comment - PLease find sample file attached.
        Hide
        tallison@mitre.org Tim Allison added a comment - - edited

        Looks like we only have one OneNote file in our regression corpus, and it is truncated/corrupt.

        Show
        tallison@mitre.org Tim Allison added a comment - - edited Looks like we only have one OneNote file in our regression corpus, and it is truncated/corrupt.
        Hide
        keshy Krishnan Narayan added a comment -

        another sample

        Show
        keshy Krishnan Narayan added a comment - another sample
        Hide
        gagravarr Nick Burch added a comment -

        Thanks for the test file, I've added it to git and created a unit test using it

        If we could find a small test .onetoc2 file and a small test .onepkg file as well, that'd be great!

        Show
        gagravarr Nick Burch added a comment - Thanks for the test file, I've added it to git and created a unit test using it If we could find a small test .onetoc2 file and a small test .onepkg file as well, that'd be great!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1169 (See https://builds.apache.org/job/Tika-trunk/1169/)
        TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test (nick: rev ef1d9077bc9312e796bc00b6a00433cdd5f23c2f)

        • (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        • (add) tika-parsers/src/test/resources/test-documents/testOneNote.one
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1169 (See https://builds.apache.org/job/Tika-trunk/1169/ ) TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test (nick: rev ef1d9077bc9312e796bc00b6a00433cdd5f23c2f) (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java (add) tika-parsers/src/test/resources/test-documents/testOneNote.one
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build tika-2.x-windows #90 (See https://builds.apache.org/job/tika-2.x-windows/90/)
        TIKA-2224 Mime magic for OneNote (nick: rev bb76d986a81e720e8f2992d8e07eb1e462337ab7)

        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote (nick: rev db21ee158a0a576c8012c8424ba1424626adc77a)
        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-2224 We now differ from HTTPD on onenote formats, as we have (nick: rev 71584b2deb5eda2742fb362b279740e8b6fed15d)
        • (edit) tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
          TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test (nick: rev cdb6456bbf1317e20f1fd11b2a9bcd1fc2282b2d)
        • (add) tika-parsers/src/test/resources/test-documents/testOneNote.one
        • (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build tika-2.x-windows #90 (See https://builds.apache.org/job/tika-2.x-windows/90/ ) TIKA-2224 Mime magic for OneNote (nick: rev bb76d986a81e720e8f2992d8e07eb1e462337ab7) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote (nick: rev db21ee158a0a576c8012c8424ba1424626adc77a) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-2224 We now differ from HTTPD on onenote formats, as we have (nick: rev 71584b2deb5eda2742fb362b279740e8b6fed15d) (edit) tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test (nick: rev cdb6456bbf1317e20f1fd11b2a9bcd1fc2282b2d) (add) tika-parsers/src/test/resources/test-documents/testOneNote.one (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #189 (See https://builds.apache.org/job/tika-2.x/189/)
        TIKA-2224 Mime magic for OneNote (nick: rev bb76d986a81e720e8f2992d8e07eb1e462337ab7)

        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote (nick: rev db21ee158a0a576c8012c8424ba1424626adc77a)
        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-2224 We now differ from HTTPD on onenote formats, as we have (nick: rev 71584b2deb5eda2742fb362b279740e8b6fed15d)
        • (edit) tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java
          TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test (nick: rev cdb6456bbf1317e20f1fd11b2a9bcd1fc2282b2d)
        • (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        • (add) tika-parsers/src/test/resources/test-documents/testOneNote.one
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #189 (See https://builds.apache.org/job/tika-2.x/189/ ) TIKA-2224 Mime magic for OneNote (nick: rev bb76d986a81e720e8f2992d8e07eb1e462337ab7) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-2224 Mime sub-entry for .onepkg, a cab file holding other onenote (nick: rev db21ee158a0a576c8012c8424ba1424626adc77a) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-2224 We now differ from HTTPD on onenote formats, as we have (nick: rev 71584b2deb5eda2742fb362b279740e8b6fed15d) (edit) tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java TIKA-2224 Test OneNote file from Krishnan Narayan plus unit test (nick: rev cdb6456bbf1317e20f1fd11b2a9bcd1fc2282b2d) (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java (add) tika-parsers/src/test/resources/test-documents/testOneNote.one
        Hide
        keshy Krishnan Narayan added a comment -

        Hi Nick

        How can I look at the change? Seems like I dont see these commits on github. Probably the mirror there does not have the latest commits? Can you point me to the resource where I can take a look at the changes?

        I am trying to find you more samples.

        Thanks
        Krishnan

        Show
        keshy Krishnan Narayan added a comment - Hi Nick How can I look at the change? Seems like I dont see these commits on github. Probably the mirror there does not have the latest commits? Can you point me to the resource where I can take a look at the changes? I am trying to find you more samples. Thanks Krishnan
        Hide
        gagravarr Nick Burch added a comment -
        Show
        gagravarr Nick Burch added a comment - They very much are on github! See https://github.com/apache/tika/commit/df14f78e46feeae16cb6cbd2cb40c44ce497d53e for example

          People

          • Assignee:
            Unassigned
            Reporter:
            gagravarr Nick Burch
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:

              Development