Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11
    • Component/s: None
    • Labels:
      None

      Description

      Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" files.

      The mimetype resolved by tika is currently text/plain.

      The correct mimetype should be text/vtt.

      see: https://w3c.github.io/webvtt/

      1. TikaVtt.java
        1.0 kB
        Abd
      2. upc-video-subtitles-en.vtt
        0.7 kB
        Alexander Widera

        Activity

        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user wiedsche opened a pull request:

        https://github.com/apache/tika/pull/59

        fix for TIKA-1772 contributed by wiedsche

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/wiedsche/tika TIKA-1772

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tika/pull/59.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #59


        commit 08a4df4e2b6a0d2cd14dc411906ed4a4a45814a3
        Author: Alexander Widera <widera@chemmedia.de>
        Date: 2015-10-16T07:15:56Z

        fix for TIKA-1772 contributed by wiedsche


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user wiedsche opened a pull request: https://github.com/apache/tika/pull/59 fix for TIKA-1772 contributed by wiedsche You can merge this pull request into a Git repository by running: $ git pull https://github.com/wiedsche/tika TIKA-1772 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/59.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #59 commit 08a4df4e2b6a0d2cd14dc411906ed4a4a45814a3 Author: Alexander Widera <widera@chemmedia.de> Date: 2015-10-16T07:15:56Z fix for TIKA-1772 contributed by wiedsche
        Hide
        gagravarr Nick Burch added a comment -

        Thanks for the patch! Couple of minor points - we normally sort the mimetypes by the mimetype string, and please try to avoid changing other whitespace at the same time.

        Patch applied with tweaks in r1708940.

        Also, any chance of a small test vtt file, that we can use for detection unit testing?

        Show
        gagravarr Nick Burch added a comment - Thanks for the patch! Couple of minor points - we normally sort the mimetypes by the mimetype string, and please try to avoid changing other whitespace at the same time. Patch applied with tweaks in r1708940. Also, any chance of a small test vtt file, that we can use for detection unit testing?
        Hide
        wiedsche Alexander Widera added a comment -

        Added example vtt file as attachment.

        Thanks for applying the patch. Sorry for changing whitespaces, it's the github client's fault. My other client showed me the changes, the github client not.
        And for the next time, I will insert a mimetype in correct order. Thanks.

        Show
        wiedsche Alexander Widera added a comment - Added example vtt file as attachment. Thanks for applying the patch. Sorry for changing whitespaces, it's the github client's fault. My other client showed me the changes, the github client not. And for the next time, I will insert a mimetype in correct order. Thanks.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #869 (See https://builds.apache.org/job/tika-trunk-jdk1.7/869/)
        TIKA-1772 WebVTT mime entry from Alexander Widera (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708940)

        • trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #869 (See https://builds.apache.org/job/tika-trunk-jdk1.7/869/ ) TIKA-1772 WebVTT mime entry from Alexander Widera (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708940 ) trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        Hide
        gagravarr Nick Burch added a comment - - edited

        Thanks for that. Looks like we can also do mime magic detection too! I've added the file, along with magic and tests, in r1708996.

        Show
        gagravarr Nick Burch added a comment - - edited Thanks for that. Looks like we can also do mime magic detection too! I've added the file, along with magic and tests, in r1708996.
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #871 (See https://builds.apache.org/job/tika-trunk-jdk1.7/871/)
        TIKA-1772 Test WebVTT file from Alexander Widera, mime magic for it, and detection tests (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708996)

        • trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
        • trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        • trunk/tika-parsers/src/test/resources/test-documents/testWebVTT.vtt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #871 (See https://builds.apache.org/job/tika-trunk-jdk1.7/871/ ) TIKA-1772 Test WebVTT file from Alexander Widera, mime magic for it, and detection tests (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708996 ) trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java trunk/tika-parsers/src/test/resources/test-documents/testWebVTT.vtt
        Hide
        chrismattmann Chris A. Mattmann added a comment -

        Nick please ref the Github PR # and it will close automatically.

        Show
        chrismattmann Chris A. Mattmann added a comment - Nick please ref the Github PR # and it will close automatically.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tika/pull/59

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/59
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in tika-trunk-jdk1.7 #872 (See https://builds.apache.org/job/tika-trunk-jdk1.7/872/)
        Fix for TIKA-1772: Mimetype of VTT files contributed by Alexander Widera <widera@chemmedia.de> this closes #59. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1709302)

        • trunk/CHANGES.txt
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #872 (See https://builds.apache.org/job/tika-trunk-jdk1.7/872/ ) Fix for TIKA-1772 : Mimetype of VTT files contributed by Alexander Widera <widera@chemmedia.de> this closes #59. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1709302 ) trunk/CHANGES.txt
        Hide
        abdelsh Abd added a comment -

        The issues is still there..
        according to the specs:
        https://w3c.github.io/webvtt/#file-structure
        "A WebVTT file body consists of the following components, in the following order:
        An optional U+FEFF BYTE ORDER MARK (BOM) character.
        The string "WEBVTT"."

        so this kind of files will return text/plain
        WEBVTT

        00:01.000 --> 00:04.000
        Never drink liquid nitrogen.

        00:05.000 --> 00:09.000

        • It will perforate your stomach.
        • You could die.

        ref: https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API

        Show
        abdelsh Abd added a comment - The issues is still there.. according to the specs: https://w3c.github.io/webvtt/#file-structure "A WebVTT file body consists of the following components, in the following order: An optional U+FEFF BYTE ORDER MARK (BOM) character. The string "WEBVTT"." so this kind of files will return text/plain WEBVTT 00:01.000 --> 00:04.000 Never drink liquid nitrogen. 00:05.000 --> 00:09.000 It will perforate your stomach. You could die. ref: https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API
        Hide
        gagravarr Nick Burch added a comment -

        Any chance you could generate one such vtt file which Tika can't currently correctly detect, and upload it to this issue? We can then use that to work on a fix, as well as to add a unit test to ensure it stays fixed.

        Show
        gagravarr Nick Burch added a comment - Any chance you could generate one such vtt file which Tika can't currently correctly detect, and upload it to this issue? We can then use that to work on a fix, as well as to add a unit test to ensure it stays fixed.
        Hide
        abdelsh Abd added a comment -

        Please find attached file TikaVtt.java
        forgot to mention it will work if the file has .vtt ext..

        Show
        abdelsh Abd added a comment - Please find attached file TikaVtt.java forgot to mention it will work if the file has .vtt ext..
        Hide
        gagravarr Nick Burch added a comment -

        Thanks for the test file! I've committed it, along with a similar version, and a modified version of your unit test. Following some additional magic entries inspired by reading the specs (thanks again for that link!), your file can now be correctly be detected even without the filename

        If you find any more VTT files we can't detect properly, please raise a new bug / re-open this one, and upload the problematic file so we can look further!

        Show
        gagravarr Nick Burch added a comment - Thanks for the test file! I've committed it, along with a similar version, and a modified version of your unit test. Following some additional magic entries inspired by reading the specs (thanks again for that link!), your file can now be correctly be detected even without the filename If you find any more VTT files we can't detect properly, please raise a new bug / re-open this one, and upload the problematic file so we can look further!
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1227 (See https://builds.apache.org/job/Tika-trunk/1227/)
        TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: https://github.com/apache/tika/commit/bb82205eece0eb68edee7d7ac24f63cf3934198f)

        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1227 (See https://builds.apache.org/job/Tika-trunk/1227/ ) TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: https://github.com/apache/tika/commit/bb82205eece0eb68edee7d7ac24f63cf3934198f ) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: https://github.com/apache/tika/commit/3c02c4b2abf10ce9745734d8eff2b7a1f5bf1765 ) (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt TIKA-1772 More WebVTT unit tests (nick: https://github.com/apache/tika/commit/40647ea4e929683ae41422bdd3144cf84f24d0e0 ) (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #232 (See https://builds.apache.org/job/tika-2.x/232/)
        TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560)

        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b)
        • (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt
        • (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt
          TIKA-1772 More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba)
        • (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #232 (See https://builds.apache.org/job/tika-2.x/232/ ) TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b) (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt TIKA-1772 More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba) (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x-windows #184 (See https://builds.apache.org/job/tika-2.x-windows/184/)
        TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560)

        • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b)
        • (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt
        • (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt
          TIKA-1772 More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba)
        • (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x-windows #184 (See https://builds.apache.org/job/tika-2.x-windows/184/ ) TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b) (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt TIKA-1772 More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba) (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java

          People

          • Assignee:
            Unassigned
            Reporter:
            wiedsche Alexander Widera
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development