Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-14010

Dispatcher & JobManagers don't give up leadership when AM is shut down

    XMLWordPrintableJSON

    Details

      Description

      In YARN deployment scenario, YARN RM possibly launches a new AM for the job even if the previous AM does not terminated, for example, when AMRM heartbeat timeout. This is a common case that RM will send a shutdown request to the previous AM and expect the AM shutdown properly.

      However, currently in YARNResourceManager, we handle this request in onShutdownRequest which simply close the YARNResourceManager but not Dispatcher and JobManagers. Thus, Dispatcher and JobManager launched in new AM cannot be granted leadership properly. Visually,

      on previous AM: Dispatcher leader, JM leaders
      on new AM: ResourceManager leader

      since on client side or in per-job mode, JobManager address and port are configured as the new AM, the whole cluster goes into an unrecoverable inconsistent status: client all queries the dispatcher on new AM who is now the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up their leadership properly.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                tison Zili Chen
                Reporter:
                tison Zili Chen
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m