Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-292

Pin reduces with consecutive IDs to nodes and have a single shuffle task per job per node

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      The idea is to reduce disk seeks while fetching the map outputs. If we opportunistically pin reduces with consecutive IDs (like 5, 6, 7 .. max-reduce-tasks on that node) on a node, and have a single shuffle task, we should benefit, if for every fetch, that shuffle task fetches all the outputs for the reduces it is shuffling for. In the case where we have 2 reduces per node, we will decrease the #seeks in the map output files on the map nodes by 50%. Memory usage by that shuffle task would be proportional to the number of reduces it is shuffling for (to account for the number of ramfs instances, one per reduce). But overall it should help.

      Thoughts?

      Attachments

        Activity

          People

            ddas Devaraj Das
            ddas Devaraj Das
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: