Tika
  1. Tika
  2. TIKA-741

"Zip bomb" (XML nesting) detection is too strict

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.10
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      I get "zip bomb" errors from many HTML documents, e.g. http://www.akhbaar.org/wesima_articles/index-20100101-82736.html

      Is there a way that the element nesting level could be made configurable? 30 elements just doesn't seem to be enough.

      Thanks!

        Activity

        Hide
        Jukka Zitting added a comment -

        Updated summary since the described behaviour is arguably an error in the default setting.

        Show
        Jukka Zitting added a comment - Updated summary since the described behaviour is arguably an error in the default setting.
        Hide
        Jukka Zitting added a comment -

        In revision 1179254 I increased the default permitted XML nesting level to 100 and introduced a separate limit of at most 10 nested <div class="package-entry"> elements to catch excessive nesting of package formats.

        The maximum nesting limits can be set directly on on the SecureContentHandler level, but are not currently configurable if you're using the Tika facade or the AutoDetectParser class. I'd like to come up with default settings that work for all practical cases before we consider adding such low level configuration options to the higher level APIs.

        Show
        Jukka Zitting added a comment - In revision 1179254 I increased the default permitted XML nesting level to 100 and introduced a separate limit of at most 10 nested <div class="package-entry"> elements to catch excessive nesting of package formats. The maximum nesting limits can be set directly on on the SecureContentHandler level, but are not currently configurable if you're using the Tika facade or the AutoDetectParser class. I'd like to come up with default settings that work for all practical cases before we consider adding such low level configuration options to the higher level APIs.
        Hide
        Erik Hetzner added a comment -

        100 levels should probably do the trick. Thanks!

        Show
        Erik Hetzner added a comment - 100 levels should probably do the trick. Thanks!

          People

          • Assignee:
            Jukka Zitting
            Reporter:
            Erik Hetzner
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development