Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2564

Tika client cannot extract files from embedded archive formats

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.18, 2.0.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Description

       

      This may be related to TIKA-2395. When trying to extract the files from 

      tika/tika-parsers/src/test/resources/test-documents/test-documents.tgz 

       

      % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI – --extract test-documents.tgz

      I see the exception:

       

      Exception in thread "main" org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.CompressorParser@62628e78

      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)

      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

      at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)

      at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:205)

      at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:486)

      at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)

      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

      at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

      at java.base/java.lang.reflect.Method.invoke(Method.java:564)

      at coursier.cli.qR.a(Unknown Source)

      at coursier.cli.qQ.j(Unknown Source)

      at coursier.cli.qW.a(Unknown Source)

      at d.h.a.c(Unknown Source)

      at b.b.c_(Unknown Source)

      at d.b.d.E.g(Unknown Source)

      at d.b.e.aW.g(Unknown Source)

      at d.b.f.b.aa.a(Unknown Source)

      at coursier.cli.qQ.b(Unknown Source)

      at coursier.cli.Q.b(Unknown Source)

      at b.J.c_(Unknown Source)

      at d.F.h(Unknown Source)

      at b.F.a(Unknown Source)

      at coursier.cli.Coursier.main(Unknown Source)

      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

      at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

      at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

      at java.base/java.lang.reflect.Method.invoke(Method.java:564)

      at coursier.Bootstrap.main(Bootstrap.java:428)

      Caused by: java.io.IOException: mark/reset not supported

      at java.base/java.io.InputStream.reset(InputStream.java:474)

      at org.apache.tika.parser.microsoft.POIFSContainerDetector.detect(POIFSContainerDetector.java:444)

      at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)

      at org.apache.tika.cli.TikaCLI$FileEmbeddedDocumentExtractor.parseEmbedded(TikaCLI.java:1045)

      at org.apache.tika.parser.pkg.CompressorParser.parse(CompressorParser.java:222)

      at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)

      ... 28 more

       

      However, I can browse the document fine using:

       

      % coursier launch org.apache.tika:tika-app:1.17 --main org.apache.tika.cli.TikaCLI – test-documents.tgz

       

      This issue affects: test-documents.rar, test-documents.tar.Z, test-documents.tbz2, and test-documents.tgz

      But it does not affect test-documents.7z, test-documents.cab, test-documents.ddf, test-documents.dmg, test-documents.tar, or test-documents.zip

       

       

       This makes me suspect that it has something to do with extracting files from packages that are embedded in other archive parsers.

       

        Attachments

          Activity

            People

            • Assignee:
              tallison Tim Allison
              Reporter:
              mprudhom Marc Prud'hommeaux
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: