Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
2.6.0
-
None
Description
We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster.
I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289
In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource.
cc - sandyr