Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4911

Non-disruptive JobManager Failures via Reconciliation

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Later
    • None
    • None
    • Runtime / Coordination
    • None

    Description

      JobManager failures can be handled in a non-disruptive way - by reconciling the new JobManager leader and the TaskManagers.
      I suggest to use this term (reconcile) - it has been uses also by other frameworks (like Mesos) for non-disruptive handling of failures.

      The basic approach is the following:

      • When a JobManager fails, TaskManagers do not cancel tasks, but attempt to reconnect to the JobManager
      • On connect, the TaskManager tells the JobManager about its currently running tasks
      • A new JobManager waits for TaskManagers to connect and report a task status. It re-constructs the ExecutionGraph state from these reports
      • Tasks whose status was not reconstructed in a certain time are assumed failed and trigger regular task recovery.

      To avoid having to re-implement this for the new JobManager / TaskManager approach in flip-6, I suggest to directly implement this into the flip-6 feature branch.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            zjwang Zhijiang
            sewen Stephan Ewen
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 10m
                10m

                Slack

                  Issue deployment