Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-200

The map task names are sent to the reduces

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.2.0
    • 0.3.0
    • None
    • None

    Description

      As each reduce is created, it is given the entire set of potential map names. For my large sort jobs with 64k maps, this means that each reduce task is given a two dimensional array that is 5 tasks/map * 64k maps = 320k strings. Since the reduce task is passed from the job tracker to the task tracker and down to the task runner, passing the entire list is very expensive. I suspect that this is the cause of the slow downs that I see in the task trackers heart beats when the reduce tasks are being launched.

      I propose that the ReduceTask be changed to just get the count of maps, with ids from 0 .. maps -1.
      public ReduceTask(String jobFile, String taskId, int maps, int partition);
      Then we need to change the protocol for finding map outputs:
      MapOutputLocation[] locateMapOutputs(String jobId, int[] mapIds, int partition);

      Attachments

        1. map-id.patch
          21 kB
          Owen O'Malley

        Activity

          People

            omalley Owen O'Malley
            omalley Owen O'Malley
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: