Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1690

Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 0.20.2, 0.20.3
    • 0.20.2
    • task, tasktracker
    • None

    Description

      When the reduce task launched, it will start several MapOutputCopier threads to download the output from finished map, every thread is a MapOutputCopier thread running instance. Every time the thread trying to copy map output from remote from local, the MapOutputCopier thread will desides to shuffle the map output data in memory or to disk, this depends on the map output data size and the configuration of the ShuffleRamManager which loaded from the client hadoop-site.xml or JobConf, no matter what, if the reduce task decides to shuffle the map output data in memory , the MapOutputCopier will connect to the remote map host , read the map output in the socket, and then copy map-output into an in-memory buffer, and every time, the in-memory buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here is where the problem begin. In our cluster, there are some special jobs which will process a huge number of original data, say 110TB, so the reduce tasks will shuffle a lot of data, some shuffled to disk and some shuffle in memory, even though, their will be a lot of data shuffled in memory, and every time the MapOutputCopier threads will "new" some memory from the reduce heap, for a long-running-huge-data job, this will easily feed the Reduce Task's heap size to the full, make the reduce task to OOM and then exhausted the memory of the TaskTracker machine.
      Here is our solution: Change the code logic when MapOutputCopier threads shuffle map-output in memory, using a BuddySystem similar to the Linux Kernel BuddySystem which used to allocate and deallocate memory page. When the reduce task launched , initialize some memory to this BuddySystem, say 128MB, everytime the reduce want to shuffle map-output in memory ,just require memory buffer from the buddySystem, if the buddySystem has enough memory , use it, and if not , let the MapOutputCopier threads to wait() just like what they do right now in the current hadoop shuffle code logic. This will reduce the Reduce Task's memory usage and reduce the TaskTracker memory shortage a lot. In our cluster, this buddySystem makes the situation of "lost a batch of tasktrackers because of memory over used when the huge jobs running " disappeared. And therefore makes the cluster more stable.

      Attachments

        Activity

          People

            Unassigned Unassigned
            luoli luoli
            Votes:
            0 Vote for this issue
            Watchers:
            16 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: