Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-24086

Do not re-register SharedStateRegistry to reduce the recovery time of the job

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      At present, we only recover the CompletedCheckpointStore when the JobManager starts, so it seems that we do not need to re-register the SharedStateRegistry when the task restarts.

      The reason for this issue is that in our production environment, we discard part of the data and state to only restart the failed task, but found that it may take several seconds to register the SharedStateRegistry (thousands of tasks and dozens of TB states). When there are a large number of task failures at the same time, this may take several minutes (number of tasks * several seconds).

      Therefore, if the SharedStateRegistry can be reused, the time for task recovery can be reduced.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            roman Roman Khachatryan
            Ming Li Ming Li
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment