[YARN-5677] RM should transition to standby when connection is lost for an extended period - ASF JIRA

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.8.0
Fix Version/s: 2.8.0, 3.0.0-alpha2
Component/s: resourcemanager
Labels:
None

Hadoop Flags:

Reviewed

Description

In trunk, there is no maximum number of retries that I see. It appears the connection will be retried forever, with the active never figuring out it's no longer active. In my testing, the active-active state lasted almost 2 hours with no sign of stopping before I killed it. The solution appears to be to cap the number of retries or amount of time spent retrying.

This issue is significant because of the asynchronous nature of job submission. If the active doesn't know it's not active, it will buffer up job submissions until it finally realizes it has become the standby. Then it will fail all the job submissions in bulk. In high-volume workflows, that behavior can create huge mass job failures.

This issue is also important because the node managers will not fail over to the new active until the old active realizes it's the standby. Workloads submitted after the old active loses contact with ZK will therefore fail to be executed regardless of which RM the clients contact.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

YARN-5677.branch-2.001.patch
17/Oct/16 19:54
12 kB
Daniel Templeton
YARN-5677.005.patch
06/Oct/16 22:41
12 kB
Daniel Templeton
YARN-5677.004.patch
06/Oct/16 19:52
12 kB
Daniel Templeton
YARN-5677.003.patch
06/Oct/16 02:19
10 kB
Daniel Templeton
YARN-5677.002.patch
06/Oct/16 00:25
4 kB
Daniel Templeton
YARN-5677.001.patch
29/Sep/16 03:17
4 kB
Daniel Templeton

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Daniel Templeton

Reporter:: Daniel Templeton

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 27/Sep/16 00:47

Updated:: 06/Jan/17 11:00

Resolved:: 25/Oct/16 23:17

Agile

View on Board

RM should transition to standby when connection is lost for an extended period

Details

Description

Attachments

Attachments

Activity

People

Dates

Agile

Slack

Issue deployment