[FLINK-24086] Do not re-register SharedStateRegistry to reduce the recovery time of the job - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.15.0
Component/s: Runtime / Checkpointing, Runtime / Coordination
Labels:
- pull-request-available

Description

At present, we only recover the CompletedCheckpointStore when the JobManager starts, so it seems that we do not need to re-register the SharedStateRegistry when the task restarts.

The reason for this issue is that in our production environment, we discard part of the data and state to only restart the failed task, but found that it may take several seconds to register the SharedStateRegistry (thousands of tasks and dozens of TB states). When there are a large number of task failures at the same time, this may take several minutes (number of tasks * several seconds).

Therefore, if the SharedStateRegistry can be reused, the time for task recovery can be reduced.

Attachments

Issue Links

blocks

FLINK-24611 Prevent JM from discarding state on checkpoint abortion

Resolved

is related to

FLINK-22483 Recover checkpoints when JobMaster gains leadership

Closed

links to

GitHub Pull Request #17179

GitHub Pull Request #18001

Activity

People

Assignee:: Roman Khachatryan

Reporter:: Ming Li

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 31/Aug/21 14:17

Updated:: 10/Dec/21 12:19

Resolved:: 09/Dec/21 16:02