Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2564

Tika client cannot extract files from embedded archive formats

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.18, 2.0.0
    • None
    • None

    Description

       

      This may be related to TIKA-2395. When trying to extract the files from 

      tika/tika-parsers/src/test/resources/test-documents/test-documents.tgz 

       

      % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI – --extract test-documents.tgz

      I see the exception:

       

      Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.CompressorParser@62628e78

      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)

      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

      at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205)

      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486)

      at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)

      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

      at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

      at java.base/java.lang.reflect.Method.invoke(Method.java:564)

      at coursier.cli.qR.a(Unknown Source)

      at coursier.cli.qQ.j(Unknown Source)

      at coursier.cli.qW.a(Unknown Source)

      at d.h.a.c(Unknown Source)

      at b.b.c_(Unknown Source)

      at d.b.d.E.g(Unknown Source)

      at d.b.e.aW.g(Unknown Source)

      at d.b.f.b.aa.a(Unknown Source)

      at coursier.cli.qQ.b(Unknown Source)

      at coursier.cli.Q.b(Unknown Source)

      at b.J.c_(Unknown Source)

      at d.F.h(Unknown Source)

      at b.F.a(Unknown Source)

      at coursier.cli.Coursier.main(Unknown Source)

      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

      at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

      at java.base/java.lang.reflect.Method.invoke(Method.java:564)

      at coursier.Bootstrap.main(Bootstrap.java:428)

      Caused by: java.io.IOException: mark/reset not supported

      at java.base/java.io.InputStream.reset(InputStream.java:474)

      at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:444)

      at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)

      at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:1045)

      at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:222)

      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

      ... 28 more

       

      However, I can browse the document fine using:

       

      % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI – test-documents.tgz

       

      This issue affects: test-documents.rar, test-documents.tar.Z, test-documents.tbz2, and test-documents.tgz

      But it does not affect test-documents.7z, test-documents.cab, test-documents.ddf, test-documents.dmg, test-documents.tar, or test-documents.zip

       

       

       This makes me suspect that it has something to do with extracting files from packages that are embedded in other archive parsers.

       

      Attachments

        Activity

          People

            tallison Tim Allison
            mprudhom Marc Prud'hommeaux
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: