Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2148

Review and update mapred --> mapreduce config params in crawl script

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.10, 2.3.1
    • Fix Version/s: 1.11
    • Component/s: bin
    • Labels:
      None

      Description

      Configuration parameters inside of $NUTCH_HOME/src/bin/crawl currently include

      commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true"
      

      as well as

        skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
        __bin_nutch parse $commonOptions $skipRecordsOptions "$CRAWL_PATH"/segments/$SEGMENT
      

      In all honesty as part of the upgrade to Hadoop 2.4.0, this should have been addressed!!! woops.

        Attachments

        1. NUTCH-2148v2.patch
          2 kB
          Lewis John McGibbney
        2. NUTCH-2148.patch
          1 kB
          Lewis John McGibbney

          Activity

            People

            • Assignee:
              lewismc Lewis John McGibbney
              Reporter:
              lewismc Lewis John McGibbney
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: