Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3316

Illegal IOException processing XPS files

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.25
    • Fix Version/s: 1.26
    • Component/s: core
    • Labels:
      None

      Description

      I have a number of (relatively simple) XPS documents which Tika fails to process.  The following exception appears:

      org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@4149c063
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
              at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:159)
              at com.mcms.Main.parseFile(Main.java:88)
              at com.mcms.Main.main(Main.java:59)
      Caused by: org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: Unsupported feature data descriptor used in entry Documents/1/Metadata/Page1_Thumbnail.JPG
              at org.apache.commons.compress.archivers.zip.ZipArchiveInputStream.read(ZipArchiveInputStream.java:477)
              at java.base/java.io.FilterInputStream.read(Unknown Source)
              at org.apache.poi.openxml4j.util.ZipArchiveThresholdInputStream.read(ZipArchiveThresholdInputStream.java:80)
              at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:182)
              at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:149)
              at org.apache.poi.util.IOUtils.toByteArray(IOUtils.java:136)
              at org.apache.poi.openxml4j.util.ZipArchiveFakeEntry.<init>(ZipArchiveFakeEntry.java:47)
              at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:53)
              at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:106)
              at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:307)
              at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:111)
              at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113)
              at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
              ... 5 more
      

       

      Obviously the generator for these files (XPS printer driver from Notepad) adds a per-page thumbnail image which Tika doesn't like.

       

       

        Attachments

        1. Screenshot from 2021-03-12 17-00-05.png
          7 kB
          Tim Allison
        2. test1.xps
          43 kB
          Nick Harmer
        3. test2.xps
          50 kB
          Nick Harmer
        4. test3.xps
          53 kB
          Nick Harmer
        5. test4.xps
          53 kB
          Nick Harmer

        Issue Links

          Activity

            People

            • Assignee:
              tallison Tim Allison
              Reporter:
              harmn1 Nick Harmer

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment