Description
Citing Sebastian at NUTCH-2420:
The correct solution would be to use <host,score> pairs as keys in the Selector job, with a partitioner and secondary sorting so that all keys with same host end up in the same call of the reducer. If values can also hold a HostDb entry and the sort comparator guarantees that the HostDb entry (entries if partitioned by domain or IP) comes in front of all CrawlDb entries. But that would be a substantial improvement...
Attachments
Attachments
Issue Links
- is related to
-
NUTCH-2924 Generate maxCount expr evaluated only once
- Closed
- is required by
-
NUTCH-2368 Variable generate.max.count and fetcher.server.delay
- Closed
- links to