[FLINK-25486] Perjob can not recover from checkpoint when zookeeper leader changes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.13.5, 1.14.2, 1.15.0
Fix Version/s: 1.13.6, 1.14.4, 1.15.0
Component/s: Runtime / Coordination
Labels:
- pull-request-available

Description

When the config high-availability.zookeeper.client.tolerate-suspended-connections is default false, the appMaster will failover once zk leader changes. In this case, the old appMaster will clean up all the zk info and the new appMaster will not recover from the latest checkpoint.

The process is as following:

Start a perJob application.
kill zk's leade node which cause the perJob to suspend.
In MiniDispatcher's function jobReachedTerminalState, shutDownFuture is set to UNKNOWN .
The future is transferred to ClusterEntrypoint, the method is called with cleanupHaData true.
Clean up zk data and exit.
The new appMaster will not find any checkpoints to start and the state is lost.

Since the job can recover automatically when the zk leader changes, it is reasonable to keep zk info for the coming recovery.

Attachments

Issue Links

links to

GitHub Pull Request #18296

Activity

People

Assignee:: Liu

Reporter:: Liu

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 30/Dec/21 11:20

Updated:: 30/Jan/22 16:23

Resolved:: 29/Jan/22 15:45