Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
None
-
None
-
None
Description
Files with extension "vtt" are "WebVTT: The Web Video Text Tracks Format" files.
The mimetype resolved by tika is currently text/plain.
The correct mimetype should be text/vtt.
Attachments
Attachments
- TikaVtt.java
- 1.0 kB
- Abd
- upc-video-subtitles-en.vtt
- 0.7 kB
- Alexander Widera
Activity
Thanks for the patch! Couple of minor points - we normally sort the mimetypes by the mimetype string, and please try to avoid changing other whitespace at the same time.
Patch applied with tweaks in r1708940.
Also, any chance of a small test vtt file, that we can use for detection unit testing?
Added example vtt file as attachment.
Thanks for applying the patch. Sorry for changing whitespaces, it's the github client's fault. My other client showed me the changes, the github client not.
And for the next time, I will insert a mimetype in correct order. Thanks.
SUCCESS: Integrated in tika-trunk-jdk1.7 #869 (See https://builds.apache.org/job/tika-trunk-jdk1.7/869/)
TIKA-1772 WebVTT mime entry from Alexander Widera (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708940)
- trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
Thanks for that. Looks like we can also do mime magic detection too! I've added the file, along with magic and tests, in r1708996.
SUCCESS: Integrated in tika-trunk-jdk1.7 #871 (See https://builds.apache.org/job/tika-trunk-jdk1.7/871/)
TIKA-1772 Test WebVTT file from Alexander Widera, mime magic for it, and detection tests (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1708996)
- trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
- trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
- trunk/tika-parsers/src/test/resources/test-documents/testWebVTT.vtt
Nick please ref the Github PR # and it will close automatically.
SUCCESS: Integrated in tika-trunk-jdk1.7 #872 (See https://builds.apache.org/job/tika-trunk-jdk1.7/872/)
Fix for TIKA-1772: Mimetype of VTT files contributed by Alexander Widera <widera@chemmedia.de> this closes #59. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1709302)
- trunk/CHANGES.txt
The issues is still there..
according to the specs:
https://w3c.github.io/webvtt/#file-structure
"A WebVTT file body consists of the following components, in the following order:
An optional U+FEFF BYTE ORDER MARK (BOM) character.
The string "WEBVTT"."
so this kind of files will return text/plain
WEBVTT
00:01.000 --> 00:04.000
Never drink liquid nitrogen.
00:05.000 --> 00:09.000
- It will perforate your stomach.
- You could die.
ref: https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API
Any chance you could generate one such vtt file which Tika can't currently correctly detect, and upload it to this issue? We can then use that to work on a fix, as well as to add a unit test to ensure it stays fixed.
Please find attached file TikaVtt.java
forgot to mention it will work if the file has .vtt ext..
Thanks for the test file! I've committed it, along with a similar version, and a modified version of your unit test. Following some additional magic entries inspired by reading the specs (thanks again for that link!), your file can now be correctly be detected even without the filename
If you find any more VTT files we can't detect properly, please raise a new bug / re-open this one, and upload the problematic file so we can look further!
SUCCESS: Integrated in Jenkins build Tika-trunk #1227 (See https://builds.apache.org/job/Tika-trunk/1227/)
TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: https://github.com/apache/tika/commit/bb82205eece0eb68edee7d7ac24f63cf3934198f)
- (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
TIKA-1772More test WebVTT files - no text header, and a custom one (nick: https://github.com/apache/tika/commit/3c02c4b2abf10ce9745734d8eff2b7a1f5bf1765) - (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt
- (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt
TIKA-1772More WebVTT unit tests (nick: https://github.com/apache/tika/commit/40647ea4e929683ae41422bdd3144cf84f24d0e0) - (edit) tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java
SUCCESS: Integrated in Jenkins build tika-2.x #232 (See https://builds.apache.org/job/tika-2.x/232/)
TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560)
- (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
TIKA-1772More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b) - (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt
- (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt
TIKA-1772More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba) - (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
SUCCESS: Integrated in Jenkins build tika-2.x-windows #184 (See https://builds.apache.org/job/tika-2.x-windows/184/)
TIKA-1772 More WebVTT magic - for cases with no header, and with custom (nick: rev 2df5c536bd3a37660dab5914a385097ed1f39560)
- (edit) tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
TIKA-1772More test WebVTT files - no text header, and a custom one (nick: rev e34498bbefc87038511e3077a80cf71d8c8fc98b) - (add) tika-parsers/src/test/resources/test-documents/testWebVTT_simple.vtt
- (add) tika-parsers/src/test/resources/test-documents/testWebVTT_header.vtt
TIKA-1772More WebVTT unit tests (nick: rev 78c31eb614231d8f75f87a09fa0d3eef9e3010ba) - (edit) tika-app/src/test/java/org/apache/tika/mime/TestMimeTypes.java
GitHub user wiedsche opened a pull request:
https://github.com/apache/tika/pull/59
fix for
TIKA-1772contributed by wiedscheYou can merge this pull request into a Git repository by running:
$ git pull https://github.com/wiedsche/tika
TIKA-1772Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tika/pull/59.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #59
commit 08a4df4e2b6a0d2cd14dc411906ed4a4a45814a3
Author: Alexander Widera <widera@chemmedia.de>
Date: 2015-10-16T07:15:56Z
fix for
TIKA-1772contributed by wiedsche