Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.3.0
-
None
-
None
Description
Saw a case where a job was stuck trying to get reducers. The issue is the capacity scheduler reserved a container on the same node as the application master but there wasn't ever enough memory to run the reducer on that node. Node total memory was 8G, Reducer needed 8G, AM was using 2G. This particular job had 10 reducers and it was stuck waiting on the one because the AM + reserved reducer memory was already over the queue limit.