Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2652

Fetcher launches more fetch tasks than fetch lists

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.15
    • Fix Version/s: 1.16
    • Component/s: fetcher
    • Labels:
      None
    • Environment:

      Description

      Fetcher may launch more fetcher tasks than there are fetch lists:

      18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 128
      18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
      

      That's one design principle of Nutch as a MapRecude-based crawler: to ensure politeness and a guaranteed delay between requests to the same host/domain/ip all items of one host/domain/ip are put by Generator into the same fetch list. A fetch list may not be split because that would violate the politeness constraints - multiple fetcher tasks processing the splits of one fetch list then may send requests to the same host/domain/ip in parallel. See Andrzej Bialecki's chapter about Nutch in Hadoop the definitive guide (3rd edition).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                snagel Sebastian Nagel
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: