Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2148

Review and update mapred --> mapreduce config params in crawl script

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.10, 2.3.1
    • 1.11
    • bin
    • None

    Description

      Configuration parameters inside of $NUTCH_HOME/src/bin/crawl currently include

      commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true"
      

      as well as

        skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D mapred.skip.map.max.skip.records=1"
        __bin_nutch parse $commonOptions $skipRecordsOptions "$CRAWL_PATH"/segments/$SEGMENT
      

      In all honesty as part of the upgrade to Hadoop 2.4.0, this should have been addressed!!! woops.

      Attachments

        1. NUTCH-2148.patch
          1 kB
          Lewis John McGibbney
        2. NUTCH-2148v2.patch
          2 kB
          Lewis John McGibbney

        Activity

          People

            lewismc Lewis John McGibbney
            lewismc Lewis John McGibbney
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: