Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-24086

Do not re-register SharedStateRegistry to reduce the recovery time of the job

    XMLWordPrintableJSON

Details

    Description

      At present, we only recover the CompletedCheckpointStore when the JobManager starts, so it seems that we do not need to re-register the SharedStateRegistry when the task restarts.

      The reason for this issue is that in our production environment, we discard part of the data and state to only restart the failed task, but found that it may take several seconds to register the SharedStateRegistry (thousands of tasks and dozens of TB states). When there are a large number of task failures at the same time, this may take several minutes (number of tasks * several seconds).

      Therefore, if the SharedStateRegistry can be reused, the time for task recovery can be reduced.

      Attachments

        Issue Links

          Activity

            People

              roman Roman Khachatryan
              Ming Li Ming Li
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: