Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
1.7.2, 1.8.0, 1.9.0
Description
The Dispatcher fails to recover jobs if a leader change happens during the JobManagerRunner termination of the previous run. The problem is that we schedule the start future of the recovered JobGraph using the MainThreadExecutor and additionally require that this future is completed before any other recovery operation from a subsequent leadership session is executed. If now the leadership changes, the MainThreadExecutor will be invalidated and the scheduled future will never be completed.
The relevant ML thread: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-7-1-job-stuck-in-suspended-state-td26439.html
Attachments
Issue Links
- is depended upon by
-
FLINK-11665 Flink fails to remove JobGraph from ZK even though it reports it did
- Closed
- is duplicated by
-
FLINK-12048 ZooKeeperHADispatcherTest failed on Travis
- Closed
- links to