Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-1043

Optimize the shuffle phase (increase the parallelism)

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      In the current shuffle code, only one map output location node is accessed from any Reduce at any given point of time. For example, if a particular node, say machine1.foo.com ran 300 maps, the reducer would fetch just one output from there at a time. machine1.foo.com will be inserted into a Set datastructure (uniqueHosts) and until it gets removed from there, no other map output will be fetched from that machine. The fact that only one map output is fetched at a time from any particular host seems fine, but the logic for removing a node from uniqueHosts is such that there could be a lot of delay before a node gets deleted from the Set datastructure (even after the map output has been fetched from that node). This probably leads to suboptimal performance since it reduces the parallelism in fetching.

        Attachments

        1. 1043.patch
          0.7 kB
          Devaraj Das

          Activity

            People

            • Assignee:
              ddas Devaraj Das
              Reporter:
              ddas Devaraj Das
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: