Details
-
Task
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
In the latest Common Crawl crawl, there are ~600k 'octet-stream' files as detected by Tika. It would be useful to extract those (even if truncated) and run 'file' and 'siegfried' against those file types that are unknown to Tika. We can prioritize the most common file formats as identified by file and siegfried for addition to our mime-types.xml.
Separately, we might also want to do the same thing for `application/zip`...there are likely zip-based file types that we could do a better job on.
Thanks to snagel for a dump of stats on the most recent crawl.
Attachments
Attachments
Issue Links
- supercedes
-
TIKA-3995 image/x-3ds
- Resolved