[YARN-6667] Handle containerId duplicate without failing the heartbeat in Federation Interceptor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.4.0
Fix Version/s: 3.4.0
Component/s: federation, router
Labels:
- pull-request-available

Target Version/s:

3.4.0
Hadoop Flags:

Reviewed

Description

From the actual situation, the probability of this happening is very low.
It can only be caused by the master-slave fail-hover of YARN and the wrong Epoch parameter configuration.

We will try to be compatible with this situation and let the Application run as much as possible, using the following measures:
1. Select a node whose heartbeat does not time out for allocation, and at the same time require the node to be in the RUNNING state.
2. If the heartbeat of both RMs does not time out, and both are in the RUNNING state, select the previously allocated RM for Container processing.

Attachments

Issue Links

links to

GitHub Pull Request #4810

Activity

People

Assignee:: Shilun Fan

Reporter:: Botong Huang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 30/May/17 21:00

Updated:: 12/Feb/24 06:46

Resolved:: 02/Sep/22 17:25