Details
-
Improvement
-
Status: Closed
-
Critical
-
Resolution: Duplicate
-
1.11.2
-
None
-
None
Description
completedCheckpointStore.recover() in restoreLatestCheckpointedStateInternal could be a bottleneck on failover because the CompletedCheckpointStore needs to load HDFS files to instantialize the CompleteCheckpoint instances.
The impact is significant in our case below:
- Jobs with high parallelism (no shuffle) which transfer data from Kafka to other filesystems.
- If a machine goes down, several containers and tens of tasks are affected, which means the completedCheckpointStore.recover() would be called tens of times since the tasks are not in a failover region.
And I notice there is a "TODO" in the source codes:
// Recover the checkpoints, TODO this could be done only when there is a new leader, not on each recovery
completedCheckpointStore.recover();
Attachments
Attachments
Issue Links
- causes
-
FLINK-19401 Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
- Resolved
- duplicates
-
FLINK-19401 Job stuck in restart loop due to excessive checkpoint recoveries which block the JobMaster
- Resolved
-
FLINK-6984 Do not recover CompletedCheckpointStore on every restore
- Closed
- relates to
-
FLINK-22483 Recover checkpoints when JobMaster gains leadership
- Closed