[YARN-8345] NodeHealthCheckerService to differentiate between reason for UnusableNodes for client to act suitably on it - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: nodemanager
Labels:
None

Description

Current Scenario :

NodeHealthCheckerService marks a node Unhealthy on basis of 2 things :

External Script
Directory status

If a directory is marked as full(as per DiskCheck configs in yarn-site), node manager marks this as unhealthy.

Once a node is marked unhealthy, mapreduce launches all the map tasks that ran on this usable node. This leads to even successful tasks being relaunched.

Problem :

We do not have distinction between disk limit to stop container launch on that node and limit so that reducer can read data from that node.

For Example :

Let us consider a 3 TB disk. If we set max disk utilisation percentage as 95% (since launch of container requires approx 0.15 TB for jobs in our cluster) and there are few nodes where disk utilisation is say 96%, the threshold will be breached. These nodes will be marked unhealthy by NodeManager. This will result in all successful mappers being relaunched on other nodes. But still 4% memory is good enough for reducers to read that data. This causes unnecessary delay in our jobs. (Mappers launching again can preempt reducers if there is crunch for space and there are issues with calculating Headroom in Capacity scheduler as well)

Correction :

We need a state (say UNUSABLE_WRITE) that can let mapreduce know that node is still good for reading data and successful mappers should not be relaunched. This can prevent delay.

Attachments

Issue Links

duplicates

YARN-1996 Provide alternative policies for UNHEALTHY nodes.

Open

Activity

People

Assignee:: Unassigned

Reporter:: Kartik Bhatia

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/May/18 10:18

Updated:: 04/Jun/18 13:46

Resolved:: 04/Jun/18 13:46