Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-8345

NodeHealthCheckerService to differentiate between reason for UnusableNodes for client to act suitably on it

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • None
    • nodemanager
    • None

    Description

      Current Scenario : 

      NodeHealthCheckerService marks a node Unhealthy on basis of 2 things : 

      1. External Script
      2. Directory status

      If a directory is marked as full(as per DiskCheck configs in yarn-site), node manager marks this as unhealthy. 

      Once a node is marked unhealthy, mapreduce launches all the map tasks that ran on this usable node. This leads to even successful tasks being relaunched.

      Problem :

      We do not have distinction between disk limit to stop container launch on that node and limit so that reducer can read data from that node.

      For Example : 

      Let us consider a 3 TB disk. If we set max disk utilisation percentage as 95% (since launch of container requires approx 0.15 TB for jobs in our cluster) and there are few nodes where disk utilisation is say 96%, the threshold will be breached. These nodes will be marked unhealthy by NodeManager. This will result in all successful mappers being relaunched on other nodes. But still 4% memory is good enough for reducers to read that data. This causes unnecessary delay in our jobs. (Mappers launching again can preempt reducers if there is crunch for space and there are issues with calculating Headroom in Capacity scheduler as well)

       

      Correction :

      We need a state (say UNUSABLE_WRITE) that can let mapreduce know that node is still good for reading data and successful mappers should not be relaunched. This can prevent delay.

        

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              kartik.bhatia Kartik Bhatia
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: