Description
https://archive.epa.gov/robots.txt lists 160k sitemap URLs, absurd! Almost 160k of them are duplicates, no friendly words to describe this astonishing fact.
And although our Nutch locally chews through this list in 22s, for some weird reason the big job on Hadoop fails, although it is also working on a lot more.
Maybe this is not a problem, maybe it is. Nevertheless, treating them as Set and not List makes sense.
Attachments
Issue Links
- depends upon
-
NUTCH-2796 Upgrade to crawler-commons 1.1
- Closed
- links to