Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-14010

Dispatcher & JobManagers don't give up leadership when AM is shut down

    XMLWordPrintableJSON

Details

    Description

      In YARN deployment scenario, YARN RM possibly launches a new AM for the job even if the previous AM does not terminated, for example, when AMRM heartbeat timeout. This is a common case that RM will send a shutdown request to the previous AM and expect the AM shutdown properly.

      However, currently in YARNResourceManager, we handle this request in onShutdownRequest which simply close the YARNResourceManager but not Dispatcher and JobManagers. Thus, Dispatcher and JobManager launched in new AM cannot be granted leadership properly. Visually,

      on previous AM: Dispatcher leader, JM leaders
      on new AM: ResourceManager leader

      since on client side or in per-job mode, JobManager address and port are configured as the new AM, the whole cluster goes into an unrecoverable inconsistent status: client all queries the dispatcher on new AM who is now the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up their leadership properly.

      Attachments

        Issue Links

          Activity

            People

              tison Zili Chen
              tison Zili Chen
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m