Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.1.0
-
None
Description
Sometimes it can somehow happen that a job is stuck waiting for the last stage to start.
There are no Tasks waiting for completion, and the job just hangs.
There are available Executors for the job to run.
I do not know how to reproduce this, all I know is that it happens randomly after couple days of hard load.
Another thing that might help is that it seems to happen when some tasks fail because one or more executors killed (due to memory issues or something).
Those tasks eventually do get finished by other executors because of retries, but the next stage hangs.