[FLINK-19596] Do not recover CompletedCheckpointStore on each failover - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Critical
Resolution: Duplicate
Affects Version/s: 1.11.2
Fix Version/s: None
Component/s: Runtime / Checkpointing
Labels:
None

Description

completedCheckpointStore.recover() in restoreLatestCheckpointedStateInternal could be a bottleneck on failover because the CompletedCheckpointStore needs to load HDFS files to instantialize the CompleteCheckpoint instances.

The impact is significant in our case below:

Jobs with high parallelism (no shuffle) which transfer data from Kafka to other filesystems.
If a machine goes down, several containers and tens of tasks are affected, which means the completedCheckpointStore.recover() would be called tens of times since the tasks are not in a failover region.

And I notice there is a "TODO" in the source codes:

// Recover the checkpoints, TODO this could be done only when there is a new leader, not on each recovery
completedCheckpointStore.recover();

Attachments

Issue Links

causes

FLINK-19401 Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

Resolved

duplicates

FLINK-19401 Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster

Resolved

FLINK-6984 Do not recover CompletedCheckpointStore on every restore

Closed

relates to

FLINK-22483 Recover checkpoints when JobMaster gains leadership

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Jiayi Liao

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 13/Oct/20 05:29

Updated:: 05/Aug/21 13:24

Resolved:: 05/Aug/21 13:24