Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-6470

ApplicationMaster may fail to preempt Reduce task

    XMLWordPrintableJSON

Details

    Description

      In my hadoop cluster the nodemanagers have different resource capacity.
      Recently, when the yarn cluster ran out of resources and there are some big jobs running, the AM cannot preempt reduce task.

      The scenario could be simplified as below:
      Say, there are 5 nodemanagers in my hadoop cluster with FairScheduler strategy enabled.

      NodeManager Capacity :

      namenode1 <1024 memory, 1 cpu-vcores>
      namenode1 <4096 memory, 1 cpu-vcores>
      namenode1 <4096 memory, 1 cpu-vcores>
      namenode1 <1024 memory, 4 cpu-vcores>
      namenode1 <1024 memory, 4 cpu-vcores>

      Start one job including 10 maps and 10 reduces with following conf :
      yarn.app.mapreduce.am.resource.mb=1024m
      yarn.app.mapreduce.am.resource.cpu-vcores=1
      mapreduce.map.memory.mb=1024m
      mapreduce.reduce.memory.mb=1024m
      mapreduce.map.cpu.vcores=1
      mapreduce.reduce.cpu.vcores=1

      After some map tasks finished, 4 reduce tasks started, but there are still some map tasks in scheduledRequests.
      At this time, the 5 nodemanagers resource usage is blow.
      NodeManager, Memory Used, Vcores Used, Memory Avail, Vcore Abail
      namenode1, 1024m, 1, 0, 0
      namenode2, 1024m, 1, 3072m, 0
      namenode3, 1024m, 1, 3072m, 0
      namenode4, 1024m, 1, 0, 3
      namenode5, 1024m, 1, 0, 3

      So AM try to start the rest map tasks.

      In RMContainerAllocator the availableResources got from ApplicationMasterService is <6144m, 6 cpu-vcores>.
      Then RMContainerAllocator thinks there is enough resource to start one map task, so it will not try to preempt the reduce task. But in fact there isn't any single nodemanager has enough resource available to run one map task. In this case, AM will fail to obtain the container to start the rest map tasks. And since reduce tasks will not be preempted, the resource will never been released, then the job hangs forever.

      I think the problem is that the overall resource headroom is not enough to help AM made the right decision on whether to preempt the reduce task or not. We need to provide more information to AM, e.g. adds a new api in AllocateResponse to get available resource list on all nodemanagers. But this approaching might cost too much overhead.

      Any ideas?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              iceberg565 NING DING
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: