Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
0.23.0
-
None
Description
Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via stopContainer(). The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
Thanks to Karams for helping with this.