Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
The DECOMMISSIONING node state is currently used as part of the graceful decommissioning mechanism to give time for tasks to complete in a node that is scheduled for decommission, and for reducer tasks to read the shuffle blocks in that node. Also, YARN effectively blacklists nodes in DECOMMISSIONING state by assigning them a capacity of 0, to prevent additional containers to be launched in those nodes, so no more shuffle blocks are written to the node. This blacklisting is not effective for applications like Spark, because a Spark executor running in a YARN container will keep receiving more tasks after the corresponding node has been blacklisted at the YARN level. We would like to propose a modification of the YARN heartbeat mechanism so nodes transitioning to DECOMMISSIONING are added to the list of updated nodes returned by the Resource Manager as a response to the Application Master heartbeat. This way a Spark application master would be able to blacklist a DECOMMISSIONING at the Spark level.
Attachments
Attachments
Issue Links
- duplicates
-
YARN-3224 Notify AM with containers (on decommissioning node) could be preempted after timeout.
- Resolved
- is related to
-
YARN-10538 Add recommissioning nodes to the list of updated nodes returned to the AM
- Resolved
-
YARN-11125 Backport YARN-6483 to branch-2.10
- Resolved
- links to