Description
I have a number of (relatively simple) XPS documents which Tika fails to process. The following exception appears:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4149c063 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159) at com.mcms.Main.parseFile(Main.java:88) at com.mcms.Main.main(Main.java:59) Caused by: org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: Unsupported feature data descriptor used in entry Documents/1/Metadata/Page1_Thumbnail.JPG at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:477) at java.base/java.io.FilterInputStream.read(Unknown Source) at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.read(ZipArchiveThresholdInputStream.java:80) at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:182) at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149) at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:136) at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47) at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53) at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106) at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:111) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 5 more
Obviously the generator for these files (XPS printer driver from Notepad) adds a per-page thumbnail image which Tika doesn't like.
Attachments
Attachments
Issue Links
- is related to
-
TIKA-3196 PackageParser should attempt to parse entries from zip files with STORED entries with data descriptor
- Resolved