Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-4911

Non-disruptive JobManager Failures via Reconciliation

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Later
    • None
    • None
    • Runtime / Coordination
    • None

    Description

      JobManager failures can be handled in a non-disruptive way - by reconciling the new JobManager leader and the TaskManagers.
      I suggest to use this term (reconcile) - it has been uses also by other frameworks (like Mesos) for non-disruptive handling of failures.

      The basic approach is the following:

      • When a JobManager fails, TaskManagers do not cancel tasks, but attempt to reconnect to the JobManager
      • On connect, the TaskManager tells the JobManager about its currently running tasks
      • A new JobManager waits for TaskManagers to connect and report a task status. It re-constructs the ExecutionGraph state from these reports
      • Tasks whose status was not reconstructed in a certain time are assumed failed and trigger regular task recovery.

      To avoid having to re-implement this for the new JobManager / TaskManager approach in flip-6, I suggest to directly implement this into the flip-6 feature branch.

      Attachments

        Issue Links

          Activity

            People

              zjwang Zhijiang
              sewen Stephan Ewen
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 10m
                  10m