Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6483

Add nodes transitioning to DECOMMISSIONING state to the list of updated nodes returned to the AM

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.1.0, 3.0.1
    • resourcemanager
    • None

    Description

      The DECOMMISSIONING node state is currently used as part of the graceful decommissioning mechanism to give time for tasks to complete in a node that is scheduled for decommission, and for reducer tasks to read the shuffle blocks in that node. Also, YARN effectively blacklists nodes in DECOMMISSIONING state by assigning them a capacity of 0, to prevent additional containers to be launched in those nodes, so no more shuffle blocks are written to the node. This blacklisting is not effective for applications like Spark, because a Spark executor running in a YARN container will keep receiving more tasks after the corresponding node has been blacklisted at the YARN level. We would like to propose a modification of the YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat. This way a Spark application master would be able to blacklist a DECOMMISSIONING at the Spark level.

      Attachments

        1. YARN-6483.branch-3.0.addendum.patch
          1 kB
          Arun Suresh
        2. YARN-6483.003.patch
          68 kB
          Juan Rodríguez Hortalá
        3. YARN-6483.002.patch
          48 kB
          Juan Rodríguez Hortalá
        4. YARN-6483-v1.patch
          4 kB
          Juan Rodríguez Hortalá

        Issue Links

          Activity

            People

              juanrh Juan Rodríguez Hortalá
              juanrh Juan Rodríguez Hortalá
              Votes:
              1 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: