Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-914 (Umbrella) Support graceful decommission of nodemanager
  3. YARN-5465

Server-Side NM Graceful Decommissioning subsequent call behavior

    XMLWordPrintableJSON

    Details

    • Type: Sub-task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: graceful
    • Labels:
      None
    • Target Version/s:

      Description

      The Server-Side NM Graceful Decommissioning feature added by YARN-4676 has the following behavior when subsequent calls are made:

      1. Start a long-running job that has containers running on nodeA
      2. Add nodeA to the exclude file
      3. Run -refreshNodes -g 120 -server (2min) to begin gracefully decommissioning nodeA
      4. Wait 30 seconds
      5. Add nodeB to the exclude file
      6. Run -refreshNodes -g 30 -server (30sec)
      7. After 30 seconds, both nodeA and nodeB shut down

      In a nutshell, issuing a subsequent call to gracefully decommission nodes updates the timeout for any currently decommissioning nodes. This makes it impossible to gracefully decommission different sets of nodes with different timeouts. Though it does let you easily update the timeout of currently decommissioning nodes.

      Another behavior we could do is this:

      1. Start a long-running job that has containers running on nodeA
        # Add nodeA to the exclude file
      2. Run -refreshNodes -g 120 -server (2min) to begin gracefully decommissioning nodeA
      3. Wait 30 seconds
      4. Add nodeB to the exclude file
      5. Run -refreshNodes -g 30 -server (30sec)
      6. After 30 seconds, nodeB shuts down
      7. After 60 more seconds, nodeA shuts down

      This keeps the nodes affected by each call to gracefully decommission nodes independent. You can now have different sets of decommissioning nodes with different timeouts. However, to update the timeout of a currently decommissioning node, you'd have to first recommission it, and then decommission it again.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              rkanter Robert Kanter
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated: