Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-11843

Dispatcher fails to recover jobs if leader change happens during JobManagerRunner termination

    XMLWordPrintableJSON

Details

    Description

      The Dispatcher fails to recover jobs if a leader change happens during the JobManagerRunner termination of the previous run. The problem is that we schedule the start future of the recovered JobGraph using the MainThreadExecutor and additionally require that this future is completed before any other recovery operation from a subsequent leadership session is executed. If now the leadership changes, the MainThreadExecutor will be invalidated and the scheduled future will never be completed.

      The relevant ML thread: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/1-7-1-job-stuck-in-suspended-state-td26439.html

      Attachments

        Issue Links

          Activity

            People

              trohrmann Till Rohrmann
              trohrmann Till Rohrmann
              Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m