Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
At present, we only recover the CompletedCheckpointStore when the JobManager starts, so it seems that we do not need to re-register the SharedStateRegistry when the task restarts.
The reason for this issue is that in our production environment, we discard part of the data and state to only restart the failed task, but found that it may take several seconds to register the SharedStateRegistry (thousands of tasks and dozens of TB states). When there are a large number of task failures at the same time, this may take several minutes (number of tasks * several seconds).
Therefore, if the SharedStateRegistry can be reused, the time for task recovery can be reduced.
Attachments
Issue Links
- blocks
-
FLINK-24611 Prevent JM from discarding state on checkpoint abortion
- Resolved
- is related to
-
FLINK-22483 Recover checkpoints when JobMaster gains leadership
- Closed
- links to