Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-2437

final map output not evenly distributed across multiple disks

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.16.0
    • Fix Version/s: 0.15.2
    • Component/s: None
    • Labels:
      None

      Description

      It seems that the final merge output of map tasks for a particular job does not select the output location in random fashion.

      This results in a job with a lot of map tasks eventually running out of taskTrackers asking for more tasks because the disk with most of the map outputs eventually has less disk space than specified by mapred.local.dir.minspacestart.

      Maybe the start of round-robin selection of multiple locations should be randomized.

      In our case:
      110,000 maps, each about 3GB final output, on a 1300 node cluster.
      Out of 4 locations and after processing about 79,000 maps, the selection for final map outputs 'file.out' looked like:
      location1: 24,000
      location2: 25
      location3: 55,000
      location4: 7

        Attachments

        1. HADOOP-2437_2_20071220.patch
          4 kB
          Arun C Murthy
        2. HADOOP-2437_1_20071218.patch
          1 kB
          Arun C Murthy
        3. HADOOP-2437_1_20071218.patch
          0.9 kB
          Arun C Murthy

          Issue Links

            Activity

              People

              • Assignee:
                acmurthy Arun C Murthy
                Reporter:
                ckunz Christian Kunz
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: