Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.7.1
-
None
-
None
Description
In my hadoop cluster the nodemanagers have different resource capacity.
Recently, when the yarn cluster ran out of resources and there are some big jobs running, the AM cannot preempt reduce task.
The scenario could be simplified as below:
Say, there are 5 nodemanagers in my hadoop cluster with FairScheduler strategy enabled.
NodeManager Capacity :
namenode1 <1024 memory, 1 cpu-vcores>
namenode1 <4096 memory, 1 cpu-vcores>
namenode1 <4096 memory, 1 cpu-vcores>
namenode1 <1024 memory, 4 cpu-vcores>
namenode1 <1024 memory, 4 cpu-vcores>
Start one job including 10 maps and 10 reduces with following conf :
yarn.app.mapreduce.am.resource.mb=1024m
yarn.app.mapreduce.am.resource.cpu-vcores=1
mapreduce.map.memory.mb=1024m
mapreduce.reduce.memory.mb=1024m
mapreduce.map.cpu.vcores=1
mapreduce.reduce.cpu.vcores=1
After some map tasks finished, 4 reduce tasks started, but there are still some map tasks in scheduledRequests.
At this time, the 5 nodemanagers resource usage is blow.
NodeManager, Memory Used, Vcores Used, Memory Avail, Vcore Abail
namenode1, 1024m, 1, 0, 0
namenode2, 1024m, 1, 3072m, 0
namenode3, 1024m, 1, 3072m, 0
namenode4, 1024m, 1, 0, 3
namenode5, 1024m, 1, 0, 3
So AM try to start the rest map tasks.
In RMContainerAllocator the availableResources got from ApplicationMasterService is <6144m, 6 cpu-vcores>.
Then RMContainerAllocator thinks there is enough resource to start one map task, so it will not try to preempt the reduce task. But in fact there isn't any single nodemanager has enough resource available to run one map task. In this case, AM will fail to obtain the container to start the rest map tasks. And since reduce tasks will not be preempted, the resource will never been released, then the job hangs forever.
I think the problem is that the overall resource headroom is not enough to help AM made the right decision on whether to preempt the reduce task or not. We need to provide more information to AM, e.g. adds a new api in AllocateResponse to get available resource list on all nodemanagers. But this approaching might cost too much overhead.
Any ideas?
Attachments
Issue Links
- is related to
-
YARN-4125 In addition to aggregate availability for an app, headroom should provide information on largest container that can be allcoated
- Open