[FLINK-14010] Dispatcher & JobManagers don't give up leadership when AM is shut down - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.7.2, 1.8.2, 1.9.0, 1.10.0
Fix Version/s: 1.8.3, 1.9.1, 1.10.0
Component/s: Deployment / YARN, Runtime / Coordination
Labels:
- pull-request-available

Description

In YARN deployment scenario, YARN RM possibly launches a new AM for the job even if the previous AM does not terminated, for example, when AMRM heartbeat timeout. This is a common case that RM will send a shutdown request to the previous AM and expect the AM shutdown properly.

However, currently in YARNResourceManager, we handle this request in onShutdownRequest which simply close the YARNResourceManager but not Dispatcher and JobManagers. Thus, Dispatcher and JobManager launched in new AM cannot be granted leadership properly. Visually,

on previous AM: Dispatcher leader, JM leaders
on new AM: ResourceManager leader

since on client side or in per-job mode, JobManager address and port are configured as the new AM, the whole cluster goes into an unrecoverable inconsistent status: client all queries the dispatcher on new AM who is now the leader. Briefly, Dispatcher and JobManagers on previous AM do not give up their leadership properly.

Attachments

Issue Links

causes

FLINK-14347 YARNSessionFIFOITCase.checkForProhibitedLogContents found a log with prohibited string

Resolved

links to

GitHub Pull Request #9719

Activity

People

Assignee:: Zili Chen

Reporter:: Zili Chen

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 09/Sep/19 08:14

Updated:: 10/Oct/19 16:06

Resolved:: 24/Sep/19 17:24

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m