Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-497

Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.8.1, 0.9.0, 1.0.0
    • Fix Version/s: 1.0.0
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      all

      Description

      Some webpages have a form of a spider trap that causes a StackOverflowException in DomContentUtils by having nested tags with thousands of layers deep. DomContentUtils when trying to get outlinks uses a recursive method to parse the html. With this type of nesting it errors out.

        Attachments

        1. nested-tags-trap3.patch
          44 kB
          Dennis Kubes
        2. nested-tags-trap2.patch
          30 kB
          Dennis Kubes
        3. nested-tags-trap.patch
          30 kB
          Dennis Kubes
        4. ExtremeNestedTags.patch
          1 kB
          Dennis Kubes

          Issue Links

            Activity

              People

              • Assignee:
                musepwizard Dennis Kubes
                Reporter:
                musepwizard Dennis Kubes
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Due:
                  Created:
                  Updated:
                  Resolved: