Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 1.11
    • None
    • None

    Description

      Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" files.

      The mimetype resolved by tika is currently text/plain.

      The correct mimetype should be text/vtt.

      see: https://w3c.github.io/webvtt/

      Attachments

        1. TikaVtt.java
          1.0 kB
          Abd
        2. upc-video-subtitles-en.vtt
          0.7 kB
          Alexander Widera

        Activity

          githubbot ASF GitHub Bot added a comment -

          GitHub user wiedsche opened a pull request:

          https://github.com/apache/tika/pull/59

          fix for TIKA-1772 contributed by wiedsche

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/wiedsche/tika TIKA-1772

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/59.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #59


          commit 08a4df4e2b6a0d2cd14dc411906ed4a4a45814a3
          Author: Alexander Widera <widera@chemmedia.de>
          Date: 2015-10-16T07:15:56Z

          fix for TIKA-1772 contributed by wiedsche


          githubbot ASF GitHub Bot added a comment - GitHub user wiedsche opened a pull request: https://github.com/apache/tika/pull/59 fix for TIKA-1772 contributed by wiedsche You can merge this pull request into a Git repository by running: $ git pull https://github.com/wiedsche/tika TIKA-1772 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/59.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #59 commit 08a4df4e2b6a0d2cd14dc411906ed4a4a45814a3 Author: Alexander Widera <widera@chemmedia.de> Date: 2015-10-16T07:15:56Z fix for TIKA-1772 contributed by wiedsche
          nick Nick Burch added a comment -

          Thanks for the patch! Couple of minor points - we normally sort the mimetypes by the mimetype string, and please try to avoid changing other whitespace at the same time.

          Patch applied with tweaks in r1708940.

          Also, any chance of a small test vtt file, that we can use for detection unit testing?

          nick Nick Burch added a comment - Thanks for the patch! Couple of minor points - we normally sort the mimetypes by the mimetype string, and please try to avoid changing other whitespace at the same time. Patch applied with tweaks in r1708940. Also, any chance of a small test vtt file, that we can use for detection unit testing?

          Added example vtt file as attachment.

          Thanks for applying the patch. Sorry for changing whitespaces, it's the github client's fault. My other client showed me the changes, the github client not.
          And for the next time, I will insert a mimetype in correct order. Thanks.

          wiedsche Alexander Widera added a comment - Added example vtt file as attachment. Thanks for applying the patch. Sorry for changing whitespaces, it's the github client's fault. My other client showed me the changes, the github client not. And for the next time, I will insert a mimetype in correct order. Thanks.
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #869 (See https://builds.apache.org/job/tika-trunk-jdk1.7/869/)
          TIKA-1772 WebVTT mime entry from Alexander Widera (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708940)

          • trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #869 (See https://builds.apache.org/job/tika-trunk-jdk1.7/869/ ) TIKA-1772 WebVTT mime entry from Alexander Widera (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708940 ) trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          nick Nick Burch added a comment - - edited

          Thanks for that. Looks like we can also do mime magic detection too! I've added the file, along with magic and tests, in r1708996.

          nick Nick Burch added a comment - - edited Thanks for that. Looks like we can also do mime magic detection too! I've added the file, along with magic and tests, in r1708996.
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #871 (See https://builds.apache.org/job/tika-trunk-jdk1.7/871/)
          TIKA-1772 Test WebVTT file from Alexander Widera, mime magic for it, and detection tests (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708996)

          • trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          • trunk/tika-parsers/src/test/resources/test-documents/testWebVTT.vtt
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #871 (See https://builds.apache.org/job/tika-trunk-jdk1.7/871/ ) TIKA-1772 Test WebVTT file from Alexander Widera, mime magic for it, and detection tests (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708996 ) trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java trunk/tika-parsers/src/test/resources/test-documents/testWebVTT.vtt

          Nick please ref the Github PR # and it will close automatically.

          chrismattmann Chris A. Mattmann added a comment - Nick please ref the Github PR # and it will close automatically.
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/59

          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/59
          hudson Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #872 (See https://builds.apache.org/job/tika-trunk-jdk1.7/872/)
          Fix for TIKA-1772: Mimetype of VTT files contributed by Alexander Widera <widera@chemmedia.de> this closes #59. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1709302)

          • trunk/CHANGES.txt
          hudson Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #872 (See https://builds.apache.org/job/tika-trunk-jdk1.7/872/ ) Fix for TIKA-1772 : Mimetype of VTT files contributed by Alexander Widera <widera@chemmedia.de> this closes #59. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1709302 ) trunk/CHANGES.txt
          abdelsh Abd added a comment -

          The issues is still there..
          according to the specs:
          https://w3c.github.io/webvtt/#file-structure
          "A WebVTT file body consists of the following components, in the following order:
          An optional U+FEFF BYTE ORDER MARK (BOM) character.
          The string "WEBVTT"."

          so this kind of files will return text/plain
          WEBVTT

          00:01.000 --> 00:04.000
          Never drink liquid nitrogen.

          00:05.000 --> 00:09.000

          • It will perforate your stomach.
          • You could die.

          ref: https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API

          abdelsh Abd added a comment - The issues is still there.. according to the specs: https://w3c.github.io/webvtt/#file-structure "A WebVTT file body consists of the following components, in the following order: An optional U+FEFF BYTE ORDER MARK (BOM) character. The string "WEBVTT"." so this kind of files will return text/plain WEBVTT 00:01.000 --> 00:04.000 Never drink liquid nitrogen. 00:05.000 --> 00:09.000 It will perforate your stomach. You could die. ref: https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API
          nick Nick Burch added a comment -

          Any chance you could generate one such vtt file which Tika can't currently correctly detect, and upload it to this issue? We can then use that to work on a fix, as well as to add a unit test to ensure it stays fixed.

          nick Nick Burch added a comment - Any chance you could generate one such vtt file which Tika can't currently correctly detect, and upload it to this issue? We can then use that to work on a fix, as well as to add a unit test to ensure it stays fixed.
          abdelsh Abd added a comment -

          Please find attached file TikaVtt.java
          forgot to mention it will work if the file has .vtt ext..

          abdelsh Abd added a comment - Please find attached file TikaVtt.java forgot to mention it will work if the file has .vtt ext..
          nick Nick Burch added a comment -

          Thanks for the test file! I've committed it, along with a similar version, and a modified version of your unit test. Following some additional magic entries inspired by reading the specs (thanks again for that link!), your file can now be correctly be detected even without the filename

          If you find any more VTT files we can't detect properly, please raise a new bug / re-open this one, and upload the problematic file so we can look further!

          nick Nick Burch added a comment - Thanks for the test file! I've committed it, along with a similar version, and a modified version of your unit test. Following some additional magic entries inspired by reading the specs (thanks again for that link!), your file can now be correctly be detected even without the filename If you find any more VTT files we can't detect properly, please raise a new bug / re-open this one, and upload the problematic file so we can look further!
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Tika-trunk #1227 (See https://builds.apache.org/job/Tika-trunk/1227/)
          TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: https://github.com/apache/tika/commit/bb82205eece0eb68edee7d7ac24f63cf3934198f)

          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1227 (See https://builds.apache.org/job/Tika-trunk/1227/ ) TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: https://github.com/apache/tika/commit/bb82205eece0eb68edee7d7ac24f63cf3934198f ) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: https://github.com/apache/tika/commit/3c02c4b2abf10ce9745734d8eff2b7a1f5bf1765 ) (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt TIKA-1772 More WebVTT unit tests (nick: https://github.com/apache/tika/commit/40647ea4e929683ae41422bdd3144cf84f24d0e0 ) (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x #232 (See https://builds.apache.org/job/tika-2.x/232/)
          TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560)

          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
            TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b)
          • (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt
          • (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt
            TIKA-1772 More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba)
          • (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #232 (See https://builds.apache.org/job/tika-2.x/232/ ) TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b) (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt TIKA-1772 More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba) (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build tika-2.x-windows #184 (See https://builds.apache.org/job/tika-2.x-windows/184/)
          TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560)

          • (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
            TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b)
          • (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt
          • (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt
            TIKA-1772 More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba)
          • (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x-windows #184 (See https://builds.apache.org/job/tika-2.x-windows/184/ ) TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560) (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml TIKA-1772 More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b) (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt TIKA-1772 More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba) (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java

          People

            Unassigned Unassigned
            wiedsche Alexander Widera
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: