Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
We're getting around 300 exceptions from "application/x-tika-msoffice" files in our current slice of Common Crawl documents that look roughly like this:
java.lang.IllegalArgumentException: Position 86528 past the end of the file at org.apache.poi.poifs.nio.FileBackedDataSource.read
I suspect these are non-MS OLE file formats. Any help identifying the file types and patching our OLE mime detector would be great.
Attachments
Attachments
Issue Links
- relates to
-
TIKA-2632 Analyze unknown govdocs files
- Open