Description
The following exception in AM of an application at the top of queue causes this. Once this happens, AM keeps obtaining
containers from RM and simply loses them. Eventually on a cluster with multiple jobs, no more scheduling happens
because of these lost containers.
It happens when there are blacklisted nodes at the app level in AM. A bug in AM
(RMContainerRequestor.containerFailedOnHost(hostName)) is causing this - nodes are simply getting removed from the
request-table. We should make sure RM also knows about this update.
========================================================================
11/06/17 06:11:18 INFO rm.RMContainerAllocator: Assigned based on host match 98.138.163.34
11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
resourceName=... numContainers=4978 #asks=5
11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
resourceName=... numContainers=4977 #asks=5
11/06/17 06:11:18 INFO rm.RMContainerRequestor: BEFORE decResourceRequest: applicationId=30 priority=20
resourceName=... numContainers=1540 #asks=5
11/06/17 06:11:18 INFO rm.RMContainerRequestor: AFTER decResourceRequest: applicationId=30 priority=20
resourceName=... numContainers=1539 #asks=6
11/06/17 06:11:18 ERROR rm.RMContainerAllocator: ERROR IN CONTACTING RM.
java.lang.NullPointerException
at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decResourceRequest(RMContainerRequestor.java:246)
at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor.decContainerReq(RMContainerRequestor.java:198)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.assign(RMContainerAllocator.java:523)
at
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator$ScheduledRequests.access$200(RMContainerAllocator.java:433)
at org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator.heartbeat(RMContainerAllocator.java:151)
at org.apache.hadoop.mapreduce.v2.app.rm.RMCommunicator$1.run(RMCommunicator.java:220)
at java.lang.Thread.run(Thread.java:619)
Attachments
Attachments
Issue Links
- duplicates
-
MAPREDUCE-3234 Locality scheduling broken due to mismatch between IPs and hosts
- Resolved