Hadoop Common
  1. Hadoop Common
  2. HADOOP-910

Reduces can do merges for the on-disk map output files in parallel with their copying

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Reducers now perform merges of shuffle data (both in-memory and on disk) while fetching map outputs. Earlier, during shuffle they used to merge only the in-memory outputs.

      Description

      Proposal to extend the parallel in-memory-merge/copying, that is being done as part of HADOOP-830, to the on-disk files.

      Today, the Reduces dump the map output files to disk and the final merge happens only after all the map outputs have been collected. It might make sense to parallelize this part. That is, whenever a Reduce has collected io.sort.factor number of segments on disk, it initiates a merge of those and creates one big segment. If the rate of copying is faster than the merge, we can probably have multiple threads doing parallel merges of independent sets of io.sort.factor number of segments. If the rate of copying is not as fast as merge, we stand to gain a lot - at the end of copying of all the map outputs, we will be left with a small number of segments for the final merge (which hopefully will feed the reduce directly (via the RawKeyValueIterator) without having to hit the disk for writing additional output segments).
      If the disk bandwidth is higher than the network bandwidth, we have a good story, I guess, to do such a thing.

      1. HADOOP-910.patch
        13 kB
        Amar Kamat
      2. HADOOP-910.patch
        14 kB
        Amar Kamat
      3. HADOOP-910.patch
        14 kB
        Amar Kamat
      4. HADOOP-910-review.patch
        13 kB
        Amar Kamat

        Activity

        Devaraj Das created issue -
        Gautam Kowshik made changes -
        Field Original Value New Value
        Assignee Gautam Kowshik [ gautamk ]
        Devaraj Das made changes -
        Assignee Gautam Kowshik [ gautamk ] Amar Kamat [ amar_kamat ]
        Amar Kamat made changes -
        Attachment HADOOP-910-review.patch [ 12373598 ]
        Amar Kamat made changes -
        Attachment HADOOP-910.patch [ 12375831 ]
        Amar Kamat made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Amar Kamat made changes -
        Attachment HADOOP-910.patch [ 12376230 ]
        Amar Kamat made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Amar Kamat made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Amar Kamat made changes -
        Attachment HADOOP-910.patch [ 12376894 ]
        Amar Kamat made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Amar Kamat made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Devaraj Das made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Fix Version/s 0.17.0 [ 12312913 ]
        Devaraj Das made changes -
        Hadoop Flags [Reviewed]
        Release Note Reducers now perform merges of shuffle data (both in-memory and on disk) while fetching map outputs. Earlier, during shuffle they used to merge only the in-memory outputs.
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Owen O'Malley made changes -
        Component/s mapred [ 12310690 ]

          People

          • Assignee:
            Amar Kamat
            Reporter:
            Devaraj Das
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development