Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
1.7.2, 1.8.2, 1.9.0, 1.10.0
Description
In YARN deployment scenario, YARN RM possibly launches a new AM for the job even if the previous AM does not terminated, for example, when AMRM heartbeat timeout. This is a common case that RM will send a shutdown request to the previous AM and expect the AM shutdown properly.
However, currently in YARNResourceManager, we handle this request in onShutdownRequest which simply close the YARNResourceManager but not Dispatcher and JobManagers. Thus, Dispatcher and JobManager launched in new AM cannot be granted leadership properly. Visually,
on previous AM: Dispatcher leader, JM leaders
on new AM: ResourceManager leader
since on client side or in per-job mode, JobManager address and port are configured as the new AM, the whole cluster goes into an unrecoverable inconsistent status: client all queries the dispatcher on new AM who is now the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up their leadership properly.
Attachments
Issue Links
- causes
-
FLINK-14347 YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string
- Resolved
- links to