Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-10429 Redesign Flink Scheduling, introducing dedicated Scheduler component
  3. FLINK-14232

Support global failure handling for DefaultScheduler (SchedulerNG)

    XMLWordPrintableJSON

Details

    Description

      Global failure handling(full restarts) is widely used in ExecutionGraph components and even other components to recover the job from an inconsistent state.

      We need to support it for DefaultScheduler to not break the safety net. More details see here.

      There can be follow ups of this task to replace usages of full restarts with JVM termination, in cases that are considered as bugs/unexpected to happen.

      Implementation plan:
      1. Add getGlobalFailureHandlingResult(Throwable) in ExecutionFailureHandler
      2. Add an interface handleGlobalFailure(Throwable) in SchedulerNG and implement it in DefaultScheduler
      3. Add an interface notifyGlobalFailure(Throwable) in InternalTaskFailuresListener and rework the implementations to use SchedulerNG#handleGlobalFailure
      4. Rework ExecutionGraph#failGlobal to invoke InternalTaskFailuresListener#notifyGlobalFailure for ng scheduler

      Attachments

        Issue Links

          Activity

            People

              zhuzh Zhu Zhu
              zhuzh Zhu Zhu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m