Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9011

Race condition during decommissioning

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.1
    • 3.3.0, 3.2.2, 3.1.4
    • nodemanager
    • None
    • Reviewed

    Description

      During internal testing, we found a nasty race condition which occurs during decommissioning.

      Node manager, incorrect behaviour:

      2018-06-18 21:00:17,634 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down.
      2018-06-18 21:00:17,634 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from ResourceManager: Disallowed NodeManager nodeId: node-6.hostname.com:8041 hostname:node-6.hostname.com
      

      Node manager, expected behaviour:

      2018-06-18 21:07:37,377 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Received SHUTDOWN signal from Resourcemanager as part of heartbeat, hence shutting down.
      2018-06-18 21:07:37,377 WARN org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Message from ResourceManager: DECOMMISSIONING  node-6.hostname.com:8041 is ready to be decommissioned
      

      Note the two different messages from the RM ("Disallowed NodeManager" vs "DECOMMISSIONING"). The problem is that ResourceTrackerService can see an inconsistent state of nodes while they're being updated:

      2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: hostsReader include:{172.26.12.198,node-7.hostname.com,node-2.hostname.com,node-5.hostname.com,172.26.8.205,node-8.hostname.com,172.26.23.76,172.26.22.223,node-6.hostname.com,172.26.9.218,node-4.hostname.com,node-3.hostname.com,172.26.13.167,node-9.hostname.com,172.26.21.221,172.26.10.219} exclude:{node-6.hostname.com}
      2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.NodesListManager: Gracefully decommission node node-6.hostname.com:8041 with state RUNNING
      2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService: Disallowed NodeManager nodeId: node-6.hostname.com:8041 node: node-6.hostname.com
      2018-06-18 21:00:17,576 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Put Node node-6.hostname.com:8041 in DECOMMISSIONING.
      2018-06-18 21:00:17,575 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=yarn     IP=172.26.22.115        OPERATION=refreshNodes  TARGET=AdminService     RESULT=SUCCESS
      2018-06-18 21:00:17,577 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Preserve original total capability: <memory:8192, vCores:8>
      2018-06-18 21:00:17,577 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: node-6.hostname.com:8041 Node Transitioned from RUNNING to DECOMMISSIONING
      

      When the decommissioning succeeds, there is no output logged from ResourceTrackerService.

      Attachments

        1. YARN-9011-001.patch
          5 kB
          Peter Bacsko
        2. YARN-9011-002.patch
          38 kB
          Peter Bacsko
        3. YARN-9011-003.patch
          53 kB
          Peter Bacsko
        4. YARN-9011-004.patch
          53 kB
          Peter Bacsko
        5. YARN-9011-005.patch
          54 kB
          Peter Bacsko
        6. YARN-9011-006.patch
          9 kB
          Peter Bacsko
        7. YARN-9011-007.patch
          11 kB
          Peter Bacsko
        8. YARN-9011-008.patch
          12 kB
          Peter Bacsko
        9. YARN-9011-009.patch
          11 kB
          Peter Bacsko
        10. YARN-9011-branch-3.1.001.patch
          11 kB
          Peter Bacsko
        11. YARN-9011-branch-3.2.001.patch
          11 kB
          Peter Bacsko

        Issue Links

          Activity

            People

              pbacsko Peter Bacsko
              pbacsko Peter Bacsko
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: