Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2730

SitemapProcessor to treat sitemap URLs as Set instead of List

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.15
    • 1.18
    • sitemap
    • None

    Description

      https://archive.epa.gov/robots.txt lists 160k sitemap URLs, absurd! Almost 160k of them are duplicates, no friendly words to describe this astonishing fact.

      And although our Nutch locally chews through this list in 22s, for some weird reason the big job on Hadoop fails, although it is also working on a lot more.

      Maybe this is not a problem, maybe it is. Nevertheless, treating them as Set and not List makes sense.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            markus17 Markus Jelsma
            markus17 Markus Jelsma
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment