Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.15
-
None
Description
Fetcher may launch more fetcher tasks than there are fetch lists:
18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 128 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
That's one design principle of Nutch as a MapRecude-based crawler: to ensure politeness and a guaranteed delay between requests to the same host/domain/ip all items of one host/domain/ip are put by Generator into the same fetch list. A fetch list may not be split because that would violate the politeness constraints - multiple fetcher tasks processing the splits of one fetch list then may send requests to the same host/domain/ip in parallel. See ab's chapter about Nutch in Hadoop the definitive guide (3rd edition).
Attachments
Issue Links
- is caused by
-
NUTCH-2375 Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
- Closed
- links to