Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-171

Bring back multiple segment support for Generate / Update

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 0.8
    • 1.0.0
    • None
    • None

    Description

      We find it convenient to be able to run generate once for -topN 300M and have multiple independent segments to work with (lower overhead) – then run update on all segments which succeeded simultaneously.

      This reactivates -numFetchers and fixes updatedb to handle multiple provided segments again.

      Radu Mateescu wrote the attached patch for us with the below description (lightly edited):

      The implementation of -numFetchers in 0.8 improperly plays with the number of reduce tasks in order to generate a given number of fetch lists. Basically, what it does is this: before the second reduce (map-reduce is applied twice for generate), it sets the number of reduce tasks to numFetchers and ideally, because each reduce will create a file like part-00000, part-00001, etc in the ndfs, we'll end up with the number of desired fetched lists. But this behaviour is incorrect for the following reasons:
      1. the number of reduce tasks is orthogonal to the number of segments somebody wants to create. The number of reduce tasks should be chosen based on the physical topology rather then the number of segments someone might want in ndfs
      2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, the numFetchers seems to be ignored

      Therefore , I changed this behaviour to work like this:

      • generate will create numFetchers segments
      • each reduce task will write in all segments (assuming there are enough values to be written) in a round-robin fashion
        The end results for 3 reduce tasks and 2 segments will look like this :

      /opt/nutch/bin>./nutch ndfs -ls segments
      060111 122227 parsing file:/opt/nutch/conf/nutch-default.xml
      060111 122228 parsing file:/opt/nutch/conf/nutch-site.xml
      060111 122228 Client connection to 192.168.0.1:5466: starting
      060111 122228 No FS indicated, using default:master:5466
      Found 2 items
      /user/root/segments/20060111122144-0 <dir>
      /user/root/segments/20060111122144-1 <dir>

      /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-0/crawl_generate
      060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
      060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
      060111 122318 No FS indicated, using default:master:5466
      060111 122318 Client connection to 192.168.0.1:5466: starting
      Found 3 items
      /user/root/segments/20060111122144-0/crawl_generate/part-00000 1276
      /user/root/segments/20060111122144-0/crawl_generate/part-00001 1289
      /user/root/segments/20060111122144-0/crawl_generate/part-00002 1858

      /opt/nutch/bin>./nutch ndfs -ls segments/20060111122144-1/crawl_generate
      060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
      060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
      060111 122334 Client connection to 192.168.0.1:5466: starting
      060111 122334 No FS indicated, using default:master:5466
      Found 3 items
      /user/root/segments/20060111122144-1/crawl_generate/part-00000 1207
      /user/root/segments/20060111122144-1/crawl_generate/part-00001 1236
      /user/root/segments/20060111122144-1/crawl_generate/part-00002 1841

      Attachments

        1. multi_segment.patch
          9 kB
          Rod Taylor

        Activity

          People

            ab Andrzej Bialecki
            rbt Rod Taylor
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: