Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2177

Generator produces only one partition even in distributed mode

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.11
    • Component/s: generator
    • Labels:
      None

      Description

      See https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542

      'mapred.job.tracker' is deprecated and has been replaced by 'mapreduce.jobtracker.address', however when running Nutch on EMR mapreduce.jobtracker.address has local as a value. As a result we generate a single partition i.e. have a single map fetching later on (which defeats the object of having a distributed crawler).

      We should probably detect whether we are running on YARN instead, see http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn

        Attachments

        1. NUTCH-2177.patch
          0.8 kB
          Julien Nioche

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jnioche Julien Nioche
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: