Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-2300

Can't tell if a zip file is encrypted

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.14
    • Fix Version/s: 2.0, 1.15
    • Component/s: None
    • Labels:
      None

      Description

      When Tika processes a zip file that is protected with a password, it will return the list of file names within the zip but no indication (as an exception or in metadata) that the file is encrypted.

      From stepping through the code, I can see that the information needed to determine whether the archive is encrypted is available inside ZipArchiveEntry#getGeneralPurposeBit#usesEncryption, but needs to be relayed back to PackageParser somehow

      1. TIKA-2300.patch
        5 kB
        Aeham Abushwashi
      2. encrypted_file.zip
        9 kB
        Aeham Abushwashi

        Activity

        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you Aeham Abushwashi! Please let us know what else you find...esp with the other flavors of streams.

        The reporting method I used is what we're also now using in the case that an embedded stream has some other type of exception before the stream hits the embedded parser (this happened fairly often with images inside ppt, IIRC).

        Also, let us know how the RecursiveParserWrapper is working out. It breaks one of Tika's core original goals streaming reads/streaming writes, but I think it is critical for some use cases to maintain metadata from embedded files.

        Cheers!

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you Aeham Abushwashi ! Please let us know what else you find...esp with the other flavors of streams. The reporting method I used is what we're also now using in the case that an embedded stream has some other type of exception before the stream hits the embedded parser (this happened fairly often with images inside ppt, IIRC). Also, let us know how the RecursiveParserWrapper is working out. It breaks one of Tika's core original goals streaming reads/streaming writes, but I think it is critical for some use cases to maintain metadata from embedded files. Cheers!
        Hide
        aeham.abushwashi Aeham Abushwashi added a comment -

        Thank you Tim Allison!

        Show
        aeham.abushwashi Aeham Abushwashi added a comment - Thank you Tim Allison !
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x-windows #181 (See https://builds.apache.org/job/tika-2.x-windows/181/)
        TIKA-2300 record streams that can't be read via pkg's metadata via Aeham (tallison: rev 29d7d7ceb30cc74afcb35a0b9497c7e4a0668eee)

        • (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
        • (edit) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java
        • (add) tika-test-resources/src/test/resources/test-documents/testZipEncrypted.zip
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x-windows #181 (See https://builds.apache.org/job/tika-2.x-windows/181/ ) TIKA-2300 record streams that can't be read via pkg's metadata via Aeham (tallison: rev 29d7d7ceb30cc74afcb35a0b9497c7e4a0668eee) (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java (edit) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java (add) tika-test-resources/src/test/resources/test-documents/testZipEncrypted.zip
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build tika-2.x #229 (See https://builds.apache.org/job/tika-2.x/229/)
        TIKA-2300 record streams that can't be read via pkg's metadata via Aeham (tallison: rev 29d7d7ceb30cc74afcb35a0b9497c7e4a0668eee)

        • (edit) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java
        • (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
        • (add) tika-test-resources/src/test/resources/test-documents/testZipEncrypted.zip
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build tika-2.x #229 (See https://builds.apache.org/job/tika-2.x/229/ ) TIKA-2300 record streams that can't be read via pkg's metadata via Aeham (tallison: rev 29d7d7ceb30cc74afcb35a0b9497c7e4a0668eee) (edit) tika-parser-modules/tika-parser-package-module/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java (edit) tika-parser-modules/tika-parser-package-module/src/main/java/org/apache/tika/parser/pkg/PackageParser.java (add) tika-test-resources/src/test/resources/test-documents/testZipEncrypted.zip
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Jenkins build Tika-trunk #1224 (See https://builds.apache.org/job/Tika-trunk/1224/)
        TIKA-2300 include exception for streams that can't be read in pkg parser (tallison: https://github.com/apache/tika/commit/f55b87fc5cb91a9e9ee067e3c1c68dff691d48a5)

        • (add) tika-parsers/src/test/resources/test-documents/testZipEncrypted.zip
        • (edit) tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java
        • (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Tika-trunk #1224 (See https://builds.apache.org/job/Tika-trunk/1224/ ) TIKA-2300 include exception for streams that can't be read in pkg parser (tallison: https://github.com/apache/tika/commit/f55b87fc5cb91a9e9ee067e3c1c68dff691d48a5 ) (add) tika-parsers/src/test/resources/test-documents/testZipEncrypted.zip (edit) tika-parsers/src/test/java/org/apache/tika/parser/pkg/ZipParserTest.java (edit) tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Aeham Abushwashi I added a stacktrace to the parent's metadata object for encrypted files, or for any stream that can't be read.

        Going forward, it will be helpful to figure out how to determine a) encryption (for non-zip) files or b) other reasons that a stream might not be readable.

        For now, the embedded exception is an encryption exception for your test stream and a general TikaException for any other reason that a package stream can't be read.

        Thank you for the issue and the patch!

        Show
        tallison@mitre.org Tim Allison added a comment - Aeham Abushwashi I added a stacktrace to the parent's metadata object for encrypted files, or for any stream that can't be read. Going forward, it will be helpful to figure out how to determine a) encryption (for non-zip) files or b) other reasons that a stream might not be readable. For now, the embedded exception is an encryption exception for your test stream and a general TikaException for any other reason that a package stream can't be read. Thank you for the issue and the patch!
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Will take a look today. Too many spinning plates. Thank you.

        Show
        tallison@mitre.org Tim Allison added a comment - Will take a look today. Too many spinning plates. Thank you.
        Hide
        aeham.abushwashi Aeham Abushwashi added a comment -

        Thanks Tim Allison, RecursiveParserWrapper solves the embedded metadata problem nicely!
        Any thoughts on the draft patch for the encryption issue?

        Show
        aeham.abushwashi Aeham Abushwashi added a comment - Thanks Tim Allison , RecursiveParserWrapper solves the embedded metadata problem nicely! Any thoughts on the draft patch for the encryption issue?
        Hide
        tallison@mitre.org Tim Allison added a comment -

        Thank you! I'll take a look soon.

        the change made me realise the rich metadata extracted by the PackageParser for the compressed/inner files never finds its way back up to users through the metadata object. Is this by design?

        Via classic Tika, y, this is what happened. This is exactly why I integrated Jukka Zitting's and Nick Burch's RecursiveParserWrapper to maintain metadata of embedded files. Give it a try with "-J -t" from the commandline with tika-app or try "/rmeta" if you're using tika-server.

        Show
        tallison@mitre.org Tim Allison added a comment - Thank you! I'll take a look soon. the change made me realise the rich metadata extracted by the PackageParser for the compressed/inner files never finds its way back up to users through the metadata object. Is this by design? Via classic Tika, y, this is what happened. This is exactly why I integrated Jukka Zitting 's and Nick Burch 's RecursiveParserWrapper to maintain metadata of embedded files. Give it a try with "-J -t" from the commandline with tika-app or try "/rmeta" if you're using tika-server.
        Hide
        aeham.abushwashi Aeham Abushwashi added a comment -

        Here's a first stab at a patch for discussion....
        PackageParser can easily figure out if the zip is encrypted (albeit with an ugly cast!). I figured users may not always want the PackageParser to abandon processing encrypted zip files and opted for adding a metadata flag to indicate the file is encrypted. This maintains backwards compatibility with TIKA-1028, but is it consistent with how Tika reports partial success/failure elsewhere?
        Also... the change made me realise the rich metadata extracted by the PackageParser for the compressed/inner files never finds its way back up to users through the metadata object. Is this by design?

        Show
        aeham.abushwashi Aeham Abushwashi added a comment - Here's a first stab at a patch for discussion.... PackageParser can easily figure out if the zip is encrypted (albeit with an ugly cast!). I figured users may not always want the PackageParser to abandon processing encrypted zip files and opted for adding a metadata flag to indicate the file is encrypted. This maintains backwards compatibility with TIKA-1028 , but is it consistent with how Tika reports partial success/failure elsewhere? Also... the change made me realise the rich metadata extracted by the PackageParser for the compressed/inner files never finds its way back up to users through the metadata object. Is this by design?
        Hide
        aeham.abushwashi Aeham Abushwashi added a comment -

        The attached file can be used to repro. It is protected with a password, "password123", using ZipCrypto

        Show
        aeham.abushwashi Aeham Abushwashi added a comment - The attached file can be used to repro. It is protected with a password, "password123", using ZipCrypto

          People

          • Assignee:
            tallison@mitre.org Tim Allison
            Reporter:
            aeham.abushwashi Aeham Abushwashi
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development