Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1043

Optimize the shuffle phase (increase the parallelism)

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.12.0
    • None
    • None

    Description

      In the current shuffle code, only one map output location node is accessed from any Reduce at any given point of time. For example, if a particular node, say machine1.foo.com ran 300 maps, the reducer would fetch just one output from there at a time. machine1.foo.com will be inserted into a Set datastructure (uniqueHosts) and until it gets removed from there, no other map output will be fetched from that machine. The fact that only one map output is fetched at a time from any particular host seems fine, but the logic for removing a node from uniqueHosts is such that there could be a lot of delay before a node gets deleted from the Set datastructure (even after the map output has been fetched from that node). This probably leads to suboptimal performance since it reduces the parallelism in fetching.

      Attachments

        1. 1043.patch
          0.7 kB
          Devaraj Das

        Activity

          People

            ddas Devaraj Das
            ddas Devaraj Das
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: