Tika
  1. Tika
  2. TIKA-741

"Zip bomb" (XML nesting) detection is too strict

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.10
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      I get "zip bomb" errors from many HTML documents, e.g. http://www.akhbaar.org/wesima_articles/index-20100101-82736.html

      Is there a way that the element nesting level could be made configurable? 30 elements just doesn't seem to be enough.

      Thanks!

        Activity

        Hide
        Pascal Essiembre added a comment -

        Got it, thanks!

        Show
        Pascal Essiembre added a comment - Got it, thanks!
        Hide
        Tim Allison added a comment -

        If the modification is really PDFBox 2.0.0 specific (how to load a file with the new memory config, etc), my personal branch. For more general additions (e.g. XFA extraction), please aim at trunk on Apache's git repo and I'll merge those with my PDFBox 2.0.0 branch after I commit to our trunk.

        Show
        Tim Allison added a comment - If the modification is really PDFBox 2.0.0 specific (how to load a file with the new memory config, etc), my personal branch. For more general additions (e.g. XFA extraction), please aim at trunk on Apache's git repo and I'll merge those with my PDFBox 2.0.0 branch after I commit to our trunk.
        Hide
        Pascal Essiembre added a comment -

        So the best way to submit PDFBox 2.0.0 related patches/updates is to your own dev branch and you'll be merging it with 1.13-SNAPSHOT ultimately, or should I go directly against 1.13-SNAPSHOT?

        Show
        Pascal Essiembre added a comment - So the best way to submit PDFBox 2.0.0 related patches/updates is to your own dev branch and you'll be merging it with 1.13-SNAPSHOT ultimately, or should I go directly against 1.13-SNAPSHOT?
        Hide
        Pascal Essiembre added a comment -

        Awesome, thanks!

        Show
        Pascal Essiembre added a comment - Awesome, thanks!
        Hide
        Tim Allison added a comment -
                <dependency>
                    <groupId>org.apache.tika</groupId>
                    <artifactId>tika-core</artifactId>
                    <version>1.13-SNAPSHOT</version>
                </dependency>
        .... other tika artifactIds...
            </dependencies>
            <repositories>
                <repository>
                    <id>apache.snapshots</id>
                    <url>http://repository.apache.org/snapshots/</url>
                </repository>
            </repositories>
        
        Show
        Tim Allison added a comment - <dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.13-SNAPSHOT</version> </dependency> .... other tika artifactIds... </dependencies> <repositories> <repository> <id>apache.snapshots</id> <url>http://repository.apache.org/snapshots/</url> </repository> </repositories>
        Hide
        Pascal Essiembre added a comment -

        Is your own dev branch found as snapshots in a public maven repo somewhere? If so I can make it the maven dependency we use for our project as opposed to the latest stable Tika release. It will make it easier to test an apply fixes on an ongoing basis.

        Show
        Pascal Essiembre added a comment - Is your own dev branch found as snapshots in a public maven repo somewhere? If so I can make it the maven dependency we use for our project as opposed to the latest stable Tika release. It will make it easier to test an apply fixes on an ongoing basis.
        Hide
        Tim Allison added a comment -

        Many thanks! I'll upload the fix on our end when I get a chance.

        Happy to help.

        Given PDFBox 2.0.0 is not out yet, are you open to upgrade Tika code base to support that version of PDFBox (replacing support for PDFBox 1.x)?

        Once 2.0 is out, y, I think we'll upgrade pretty quickly. See: TIKA-1285 and PDFBOX-3058 for our collaboration in support of 2.0 regression testing. My dev branch for the integration with Tika is on github

        like extracting XFA text. I can submit a patch for that as well if you are open.

        Yes, please!

        I also noticed that you have some wrappers around Tika more generally. Again, if there's anything that would generally help Tika, please send along. You may want to check out our RecursiveParserWrapper...looks like that has some overlapping functionality with what you're doing.

        Happy extraction! Cheers!

        Show
        Tim Allison added a comment - Many thanks! I'll upload the fix on our end when I get a chance. Happy to help. Given PDFBox 2.0.0 is not out yet, are you open to upgrade Tika code base to support that version of PDFBox (replacing support for PDFBox 1.x)? Once 2.0 is out, y, I think we'll upgrade pretty quickly. See: TIKA-1285 and PDFBOX-3058 for our collaboration in support of 2.0 regression testing. My dev branch for the integration with Tika is on github like extracting XFA text. I can submit a patch for that as well if you are open. Yes, please! I also noticed that you have some wrappers around Tika more generally. Again, if there's anything that would generally help Tika, please send along. You may want to check out our RecursiveParserWrapper...looks like that has some overlapping functionality with what you're doing. Happy extraction! Cheers!
        Hide
        Pascal Essiembre added a comment -

        What? That easy? Those two simple lines did it in my local testing! Many thanks! I'll upload the fix on our end when I get a chance.

        As far as sharing, I am all for it, but our changes to the Tika parser classes for PDF are mainly to support PDFBox 2.0.0 which resolved several PDF issues reported by some of our users (like better detecting of spaces between terms in some PDF). Given PDFBox 2.0.0 is not out yet, are you open to upgrade Tika code base to support that version of PDFBox (replacing support for PDFBox 1.x)?

        I think I added a few more things that may not be (or was not at the time) in Tika parser, like extracting XFA text. I can submit a patch for that as well if you are open.

        Thanks again.

        Show
        Pascal Essiembre added a comment - What? That easy? Those two simple lines did it in my local testing! Many thanks! I'll upload the fix on our end when I get a chance. As far as sharing, I am all for it, but our changes to the Tika parser classes for PDF are mainly to support PDFBox 2.0.0 which resolved several PDF issues reported by some of our users (like better detecting of spaces between terms in some PDF). Given PDFBox 2.0.0 is not out yet, are you open to upgrade Tika code base to support that version of PDFBox (replacing support for PDFBox 1.x)? I think I added a few more things that may not be (or was not at the time) in Tika parser, like extracting XFA text. I can submit a patch for that as well if you are open. Thanks again.
        Hide
        Tim Allison added a comment - - edited

        I'd recommend adding the following to your EnhancedPDF2XHTML:

            @Override
            protected void writeParagraphStart() throws IOException {
        +        super.writeParagraphStart();
        

        and

            @Override
            protected void writeParagraphEnd() throws IOException {
        +        super.writeParagraphEnd();
        

        Finally, if your modifications of our PDFParsers are enhancements that have general applicability, please, oh, please share them with us.

        Show
        Tim Allison added a comment - - edited I'd recommend adding the following to your EnhancedPDF2XHTML: @Override protected void writeParagraphStart() throws IOException { + super.writeParagraphStart(); and @Override protected void writeParagraphEnd() throws IOException { + super.writeParagraphEnd(); Finally, if your modifications of our PDFParsers are enhancements that have general applicability, please, oh, please share them with us.
        Hide
        Tim Allison added a comment -

        How are you replicating this with 1.11? I'm not able to replicate this with the tika-app gui or commandline. I'm also not able to replicate this as a unit test in trunk.

        In the stack trace, it looks like there are modified Tika classes: EnhancedPDFParser and EnhancedPDF2XHTML. If these modified classes are forgetting to close an entity, you'll get this exception...as I found when working with Acroforms :/

        In short, if you can help me replicate this with pure Tika, I'll be happy to take a look.

        Show
        Tim Allison added a comment - How are you replicating this with 1.11? I'm not able to replicate this with the tika-app gui or commandline. I'm also not able to replicate this as a unit test in trunk. In the stack trace, it looks like there are modified Tika classes: EnhancedPDFParser and EnhancedPDF2XHTML. If these modified classes are forgetting to close an entity, you'll get this exception...as I found when working with Acroforms :/ In short, if you can help me replicate this with pure Tika, I'll be happy to take a look.
        Hide
        Pascal Essiembre added a comment -

        It looks like maxDepth 100 is not enough. I am using Tika 1.11 and I am able to reproduce a false "zip bomb" exception with the following PDF: https://www.db.com/ir/en/download/DB_Interim_Report_1Q2015.pdf

        Without a maximum, the currentDepth for this file goes up to 120 in SecureContentHandler. Not being configurable at a higher level makes it impossible to parse this file with the current code.

        Shall I create a new ticket instead? Or may be re-open the following one to make it configurable: #TIKA-860 ? If we can't make it configurable, maybe a much higher maxDepth is a good idea (e.g. 1000)?

        FYI, this specific PDF issue was originally reported here: https://github.com/Norconex/collector-http/issues/221

        Show
        Pascal Essiembre added a comment - It looks like maxDepth 100 is not enough. I am using Tika 1.11 and I am able to reproduce a false "zip bomb" exception with the following PDF: https://www.db.com/ir/en/download/DB_Interim_Report_1Q2015.pdf Without a maximum, the currentDepth for this file goes up to 120 in SecureContentHandler. Not being configurable at a higher level makes it impossible to parse this file with the current code. Shall I create a new ticket instead? Or may be re-open the following one to make it configurable: # TIKA-860 ? If we can't make it configurable, maybe a much higher maxDepth is a good idea (e.g. 1000)? FYI, this specific PDF issue was originally reported here: https://github.com/Norconex/collector-http/issues/221
        Hide
        Erik Hetzner added a comment -

        100 levels should probably do the trick. Thanks!

        Show
        Erik Hetzner added a comment - 100 levels should probably do the trick. Thanks!
        Hide
        Jukka Zitting added a comment -

        In revision 1179254 I increased the default permitted XML nesting level to 100 and introduced a separate limit of at most 10 nested <div class="package-entry"> elements to catch excessive nesting of package formats.

        The maximum nesting limits can be set directly on on the SecureContentHandler level, but are not currently configurable if you're using the Tika facade or the AutoDetectParser class. I'd like to come up with default settings that work for all practical cases before we consider adding such low level configuration options to the higher level APIs.

        Show
        Jukka Zitting added a comment - In revision 1179254 I increased the default permitted XML nesting level to 100 and introduced a separate limit of at most 10 nested <div class="package-entry"> elements to catch excessive nesting of package formats. The maximum nesting limits can be set directly on on the SecureContentHandler level, but are not currently configurable if you're using the Tika facade or the AutoDetectParser class. I'd like to come up with default settings that work for all practical cases before we consider adding such low level configuration options to the higher level APIs.
        Hide
        Jukka Zitting added a comment -

        Updated summary since the described behaviour is arguably an error in the default setting.

        Show
        Jukka Zitting added a comment - Updated summary since the described behaviour is arguably an error in the default setting.

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Erik Hetzner
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development