Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4343 Implement new TaskManager
  3. FLINK-6160

Retry JobManager/ResourceManager connection in case of timeout

    Details

    • Type: Sub-task
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.3.0, 1.5.0, 1.6.0
    • Fix Version/s: 1.5.0
    • Labels:

      Description

      In case of a heartbeat timeout, the TaskExecutor closes the connection to the remote component. Furthermore, it assumes that the component has actually failed and, thus, it will only start trying to connect to the component if it is notified about a new leader address and leader session id. This is brittle, because the heartbeat could also time out without the component having crashed. Thus, we should add an automatic retry to the latest known leader address information in case of a timeout.

      Acceptance criteria:

      • The registration should be retried until a time limit expires after which the TaskExecutor terminates

        Attachments

          Activity

            People

            • Assignee:
              till.rohrmann Till Rohrmann
              Reporter:
              till.rohrmann Till Rohrmann
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: