[FLINK-11843] Dispatcher fails to recover jobs if leader change happens during JobManagerRunner termination - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.7.2, 1.8.0, 1.9.0
Fix Version/s: 1.10.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

The Dispatcher fails to recover jobs if a leader change happens during the JobManagerRunner termination of the previous run. The problem is that we schedule the start future of the recovered JobGraph using the MainThreadExecutor and additionally require that this future is completed before any other recovery operation from a subsequent leadership session is executed. If now the leadership changes, the MainThreadExecutor will be invalidated and the scheduled future will never be completed.

The relevant ML thread: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-7-1-job-stuck-in-suspended-state-td26439.html

Attachments

Issue Links

is depended upon by

FLINK-11665 Flink fails to remove JobGraph from ZK even though it reports it did

Closed

is duplicated by

FLINK-12048 ZooKeeperHADispatcherTest failed on Travis

Closed

links to

GitHub Pull Request #9832

Activity

People

Assignee:: Till Rohrmann

Reporter:: Till Rohrmann

Votes:: 2 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 06/Mar/19 15:32

Updated:: 19/May/20 14:29

Resolved:: 25/Oct/19 15:38

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m