[FLINK-18451] Flink HA on yarn may appear TaskManager double running when HA is restored - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Abandoned
Affects Version/s: 1.9.0
Fix Version/s: None
Component/s: Deployment / YARN
Labels:
- high-availability

Description

We found that when NodeManager is lost, the new JobManager will be restored by Yarn's ResourceManager, and the Leader node will be registered on Zookeeper. The original TaskManager will find the new JobManager through Zookeeper and close the old JobManager connection. At this time, all tasks of the TaskManager will fail. The new JobManager will directly perform job recovery and recover from the latest checkpoint.

However, during the recovery process, when a TaskManager is abnormally connected to Zookeeper, it is not registered with the new JobManager in time. Before the following timeout:
1. Connect with Zookeeper
2. Heartbeat with JobManager/ResourceManager
Task will continue to run (assuming that Task can run independently in TaskManager). Assuming that HA recovers fast enough, some Task double runs will occur at this time.

Do we need to make a persistent record of the cluster resources we allocated during the runtime, and use it to judge all Task stops when HA is restored?

Attachments

Issue Links

is related to

FLINK-18677 ZooKeeperLeaderRetrievalService does not invalidate leader in case of SUSPENDED connection

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Ming Li

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 30/Jun/20 03:08

Updated:: 12/Feb/21 10:57

Resolved:: 12/Feb/21 10:57