Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-3194

RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 2.7.0
    • 2.7.0, 2.6.2, 3.0.0-alpha1
    • resourcemanager
    • None
    • NM restart is enabled

    • Reviewed

    Description

      On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state.

      1. Node added event : Again here 2 scenario's can occur
        1. New node is registering with different ip:port – NOT A PROBLEM
        2. Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM
      1. Node reconnected event :
        1. Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted
          1. NM RESTART NOT Enabled – NOT A PROBLEM
          2. NM RESTART is Enabled
            1. Some applications are running on this node – Problem is here
            2. Zero applications are running on this node – NOT A PROBLEM

      Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM.

      Attachments

        1. 0001-yarn-3194-v1.patch
          13 kB
          Rohith Sharma K S
        2. 0001-YARN-3194.patch
          18 kB
          Rohith Sharma K S

        Activity

          jianhe Jian He added a comment -

          rohithsharma, the completed containers are also notified to applications. mind clarify more how the application is hung ?

          jianhe Jian He added a comment - rohithsharma , the completed containers are also notified to applications. mind clarify more how the application is hung ?

          The current flow of notifying applications for completed containers are via RM.

          1. AM sends stopContainerRequest to NM
          2. NM kills the container and send containerStatus to RM in heartbeat ONLY ONCE. NM keeps this container until RM piggybacks to NM saying container can be removed.
          3. RM releases the resources which are allocated for containers. And sends back to NM for removing container from NM's tracking.
          4. In AM heartbeat to AM, RM sends completed containers details to AM

          ICO NM Restart, NM sends outstanding NMContainerStatus lists to RM during NM registration. This status has boh RUNNING and COMPLETED containers.These containers status never send again by RM. But on NM registration , RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored. The impact from this is Resources which are being allocated to these containers never be released from RM. At this time, RM state is completely inconsistent with actual cluster state.

          Application is hanging because in the above inconsistent state, RM never allocate new containers which are asked by AM.So AM will be keep waiting for containers from RM.

          Earlier to YARN-2997 fix, it should work fine because those completed containers which are sent in NM registration are sent again in 1st heartbeat of NM. So RMNodeImpl handles completed containers in nodeHeartBeat event. The bug was hidden before yarn-2997 because NM was sending duplicated container status. But actually RM should had handle it.
          Does it make sense?

          rohithsharma Rohith Sharma K S added a comment - The current flow of notifying applications for completed containers are via RM. AM sends stopContainerRequest to NM NM kills the container and send containerStatus to RM in heartbeat ONLY ONCE. NM keeps this container until RM piggybacks to NM saying container can be removed. RM releases the resources which are allocated for containers. And sends back to NM for removing container from NM's tracking. In AM heartbeat to AM, RM sends completed containers details to AM ICO NM Restart, NM sends outstanding NMContainerStatus lists to RM during NM registration. This status has boh RUNNING and COMPLETED containers.These containers status never send again by RM. But on NM registration , RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored. The impact from this is Resources which are being allocated to these containers never be released from RM. At this time, RM state is completely inconsistent with actual cluster state. Application is hanging because in the above inconsistent state, RM never allocate new containers which are asked by AM.So AM will be keep waiting for containers from RM. Earlier to YARN-2997 fix, it should work fine because those completed containers which are sent in NM registration are sent again in 1st heartbeat of NM. So RMNodeImpl handles completed containers in nodeHeartBeat event. The bug was hidden before yarn-2997 because NM was sending duplicated container status. But actually RM should had handle it. Does it make sense?
          jianhe Jian He added a comment -

          RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored.

          Completed containers are also processed. please refer to {{RMContainerImpl#ContainerRecoveredTransition}.
          Both running and completed containers sent by NM on re-registration will be processed by the new RM and routed back to the AM.

          jianhe Jian He added a comment - RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored. Completed containers are also processed. please refer to {{RMContainerImpl#ContainerRecoveredTransition}. Both running and completed containers sent by NM on re-registration will be processed by the new RM and routed back to the AM.
          jianhe Jian He added a comment -

          I'm downgrading the priority for now. Please raise the priority if you still think this is a real issue.

          jianhe Jian He added a comment - I'm downgrading the priority for now. Please raise the priority if you still think this is a real issue.

          Thanks jianhe for pointing me out container recovery flow!! Issue priority can decided later,not a problem.

          I had deeper look about NM registration flow. There are 2 scenario's can occur

          1. Node added event : Again here 2 scenario's can occur
            1. New node is registering with different ip:port – NOT A PROBLEM
            2. Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM
          2. Node reconnected event :
            1. Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted
              1. NM RESTART NOT Enabled – NOT A PROBLEM
              2. NM RESTART is Enabled – Problem is here
                When Node is reconnected and applications are running in that node, NMContainerStatus are ignored. I think RMNodeReconnectEvent should consider NMContainerStatus and process it.
          rohithsharma Rohith Sharma K S added a comment - Thanks jianhe for pointing me out container recovery flow!! Issue priority can decided later,not a problem. I had deeper look about NM registration flow. There are 2 scenario's can occur Node added event : Again here 2 scenario's can occur New node is registering with different ip:port – NOT A PROBLEM Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM Node reconnected event : Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted NM RESTART NOT Enabled – NOT A PROBLEM NM RESTART is Enabled – Problem is here When Node is reconnected and applications are running in that node, NMContainerStatus are ignored. I think RMNodeReconnectEvent should consider NMContainerStatus and process it.

          And Not related specific to this jira, I have one doubt on this method ResourceTrackerService#handleNMContainerStatus is sending container_finished event only to master container. Why other containers are not considered? I think it is made intentionally for optimization so if container_finished event would release other containers resources. Is this is the reason?

          rohithsharma Rohith Sharma K S added a comment - And Not related specific to this jira, I have one doubt on this method ResourceTrackerService#handleNMContainerStatus is sending container_finished event only to master container. Why other containers are not considered? I think it is made intentionally for optimization so if container_finished event would release other containers resources. Is this is the reason?
          jianhe Jian He added a comment -

          I have one doubt on this method ResourceTrackerService#handleNMContainerStatus

          This is legacy code for non-work-preserving restart. we could remove that. Just disregard this method.

          NM RESTART is Enabled – Problem is here

          For node_reconnect event, it's removing the old node and adding the newly connected node. RM is also not restarted. I don't think we need to handle the RMNodeReconnectEvent

          jianhe Jian He added a comment - I have one doubt on this method ResourceTrackerService#handleNMContainerStatus This is legacy code for non-work-preserving restart. we could remove that. Just disregard this method. NM RESTART is Enabled – Problem is here For node_reconnect event, it's removing the old node and adding the newly connected node. RM is also not restarted. I don't think we need to handle the RMNodeReconnectEvent

          it's removing the old node and adding the newly connected node. RM is also not restarted.

          RMNodeImpl#ReconnectNodeTransition#.transition does not remove old node if any applications are running. In the below code, if noRunningApps is false then Node is not removed. Instead just handling running applications.

          public void transition(RMNodeImpl rmNode, RMNodeEvent event) {
                RMNodeReconnectEvent reconnectEvent = (RMNodeReconnectEvent) event;
                RMNode newNode = reconnectEvent.getReconnectedNode();
                rmNode.nodeManagerVersion = newNode.getNodeManagerVersion();
                List<ApplicationId> runningApps = reconnectEvent.getRunningApplications();
                boolean noRunningApps = 
                    (runningApps == null) || (runningApps.size() == 0);
                
                // No application running on the node, so send node-removal event with 
                // cleaning up old container info.
                if (noRunningApps) {
                  // Remove the node from scheduler
                  // Add node to the scheduler
                } else {
                  rmNode.httpPort = newNode.getHttpPort();
                  rmNode.httpAddress = newNode.getHttpAddress();
                  rmNode.totalCapability = newNode.getTotalCapability();
                
                  // Reset heartbeat ID since node just restarted.
                  rmNode.getLastNodeHeartBeatResponse().setResponseId(0);
                }
          
                // Handles running app on this node
              // resource update to schedule code
                }
                
              }
          
          rohithsharma Rohith Sharma K S added a comment - it's removing the old node and adding the newly connected node. RM is also not restarted. RMNodeImpl#ReconnectNodeTransition#.transition does not remove old node if any applications are running. In the below code, if noRunningApps is false then Node is not removed. Instead just handling running applications. public void transition(RMNodeImpl rmNode, RMNodeEvent event) { RMNodeReconnectEvent reconnectEvent = (RMNodeReconnectEvent) event; RMNode newNode = reconnectEvent.getReconnectedNode(); rmNode.nodeManagerVersion = newNode.getNodeManagerVersion(); List<ApplicationId> runningApps = reconnectEvent.getRunningApplications(); boolean noRunningApps = (runningApps == null ) || (runningApps.size() == 0); // No application running on the node, so send node-removal event with // cleaning up old container info. if (noRunningApps) { // Remove the node from scheduler // Add node to the scheduler } else { rmNode.httpPort = newNode.getHttpPort(); rmNode.httpAddress = newNode.getHttpAddress(); rmNode.totalCapability = newNode.getTotalCapability(); // Reset heartbeat ID since node just restarted. rmNode.getLastNodeHeartBeatResponse().setResponseId(0); } // Handles running app on this node // resource update to schedule code } }
          devaraj Devaraj Kavali added a comment -

          Thanks rohithsharma for reporting and jianhe for your inputs.

          I am also able to reproduce this issue, I see that RM is not getting information about the completed containers after NM restart as part of the nodeheartbeat request and also AM is not informing the RM to release these completed containers. As a result RM is assuming these containers are running and not releasing these completed container resources.

          devaraj Devaraj Kavali added a comment - Thanks rohithsharma for reporting and jianhe for your inputs. I am also able to reproduce this issue, I see that RM is not getting information about the completed containers after NM restart as part of the nodeheartbeat request and also AM is not informing the RM to release these completed containers. As a result RM is assuming these containers are running and not releasing these completed container resources.

          Attached the version-1 patch.
          The patch does following

          1. Added ReconnectedEvent to process NMContainerStatus if applications are running on the node

          Kindly review the patch

          rohithsharma Rohith Sharma K S added a comment - Attached the version-1 patch. The patch does following Added ReconnectedEvent to process NMContainerStatus if applications are running on the node Kindly review the patch
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12699148/0001-yarn-3194-v1.patch
          against trunk revision 814afa4.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

          org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6647//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6647//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6647//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699148/0001-yarn-3194-v1.patch against trunk revision 814afa4. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6647//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6647//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6647//console This message is automatically generated.
          junping_du Junping Du added a comment -

          I think NM after restarted will try to relaunch these running containers as RecoveredContainers, and if it cannot locate the pid (assume container get completed during NM downtime), it would report and trigger the complete of containers. Do I miss anything here?
          jlowe, I remember we discussed this case in some JIRA under YARN-1336, did you see this problem before?

          junping_du Junping Du added a comment - I think NM after restarted will try to relaunch these running containers as RecoveredContainers, and if it cannot locate the pid (assume container get completed during NM downtime), it would report and trigger the complete of containers. Do I miss anything here? jlowe , I remember we discussed this case in some JIRA under YARN-1336 , did you see this problem before?
          rohithsharma Rohith Sharma K S added a comment - - edited

          junping_du I see the same behaviour which you explained after NM restart from NM logs.

          rohithsharma Rohith Sharma K S added a comment - - edited junping_du I see the same behaviour which you explained after NM restart from NM logs.

          Jason Lowe, I remember we discussed this case in some JIRA under YARN-1336, did you see this problem before?

          I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before YARN-2997. In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks).

          I agree that we should be processing the container report sent with the NM registration, and it appears that is being dropped in the reconnected event.

          Comments on the patch:

          I noticed that the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition. One difference is that we don't remove containers that have completed from the launchedContainers map which seems wrong. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM. Therefore I think we should refactor the code to reuse this logic, as it should apply here just as it does for StatusUpdateWhenHealthyTransition.

          jlowe Jason Darrell Lowe added a comment - Jason Lowe, I remember we discussed this case in some JIRA under YARN-1336 , did you see this problem before? I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before YARN-2997 . In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks). I agree that we should be processing the container report sent with the NM registration, and it appears that is being dropped in the reconnected event. Comments on the patch: I noticed that the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition. One difference is that we don't remove containers that have completed from the launchedContainers map which seems wrong. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM. Therefore I think we should refactor the code to reuse this logic, as it should apply here just as it does for StatusUpdateWhenHealthyTransition.
          jianhe Jian He added a comment -

          rohithsharma, thanks for your explanation. could you edit the description to be more clear about the problem ?

          • Is it possible to have a common method for below code in ReconnectNodeTransition and StatusUpdateWhenHealthyTransition ?
                    // Filter the map to only obtain just launched containers and finished
                    // containers.
                    List<ContainerStatus> newlyLaunchedContainers =
                        new ArrayList<ContainerStatus>();
                    List<ContainerStatus> completedContainers =
                        new ArrayList<ContainerStatus>();
                    for (NMContainerStatus remoteContainer : reconnectEvent
                        .getNMContainerStatuses()) {
                      ContainerId containerId = remoteContainer.getContainerId();
            
                      // Process running containers
                      if (remoteContainer.getContainerState() == ContainerState.RUNNING) {
                        if (!rmNode.launchedContainers.contains(containerId)) {
                          // Just launched container. RM knows about it the first time.
                          rmNode.launchedContainers.add(containerId);
                          ContainerStatus cStatus = createContainerStatus(remoteContainer);
                          newlyLaunchedContainers.add(cStatus);
                        }
                      } else {
            
                        ContainerStatus cStatus = createContainerStatus(remoteContainer);
                        completedContainers.add(cStatus);
                      }
                    }
                    if (newlyLaunchedContainers.size() != 0
                        || completedContainers.size() != 0) {
                      rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo(
                          newlyLaunchedContainers, completedContainers));
                    }
            
          • Is below condition valid for the newly added code in ReconnectNodeTransition too ?
                    // Don't bother with containers already scheduled for cleanup, or for
                    // applications already killed. The scheduler doens't need to know any
                    // more about this container
                    if (rmNode.containersToClean.contains(containerId)) {
                      LOG.info("Container " + containerId + " already scheduled for " +
                      		"cleanup, no further processing");
                      continue;
                    }
                    if (rmNode.finishedApplications.contains(containerId
                        .getApplicationAttemptId().getApplicationId())) {
                      LOG.info("Container " + containerId
                          + " belongs to an application that is already killed,"
                          + " no further processing");
                      continue;
                    }
            
          • Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ?
            @Test
              public void testAppCleanupWhenNMRstarts() throws Exception
            
          • Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ?
             Assert.assertEquals(3072, allocatedMB);
            
          • Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario?
          jianhe Jian He added a comment - rohithsharma , thanks for your explanation. could you edit the description to be more clear about the problem ? Is it possible to have a common method for below code in ReconnectNodeTransition and StatusUpdateWhenHealthyTransition ? // Filter the map to only obtain just launched containers and finished // containers. List<ContainerStatus> newlyLaunchedContainers = new ArrayList<ContainerStatus>(); List<ContainerStatus> completedContainers = new ArrayList<ContainerStatus>(); for (NMContainerStatus remoteContainer : reconnectEvent .getNMContainerStatuses()) { ContainerId containerId = remoteContainer.getContainerId(); // Process running containers if (remoteContainer.getContainerState() == ContainerState.RUNNING) { if (!rmNode.launchedContainers.contains(containerId)) { // Just launched container. RM knows about it the first time. rmNode.launchedContainers.add(containerId); ContainerStatus cStatus = createContainerStatus(remoteContainer); newlyLaunchedContainers.add(cStatus); } } else { ContainerStatus cStatus = createContainerStatus(remoteContainer); completedContainers.add(cStatus); } } if (newlyLaunchedContainers.size() != 0 || completedContainers.size() != 0) { rmNode.nodeUpdateQueue.add( new UpdatedContainerInfo( newlyLaunchedContainers, completedContainers)); } Is below condition valid for the newly added code in ReconnectNodeTransition too ? // Don't bother with containers already scheduled for cleanup, or for // applications already killed. The scheduler doens't need to know any // more about this container if (rmNode.containersToClean.contains(containerId)) { LOG.info( "Container " + containerId + " already scheduled for " + "cleanup, no further processing" ); continue ; } if (rmNode.finishedApplications.contains(containerId .getApplicationAttemptId().getApplicationId())) { LOG.info( "Container " + containerId + " belongs to an application that is already killed," + " no further processing" ); continue ; } Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ? @Test public void testAppCleanupWhenNMRstarts() throws Exception Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ? Assert.assertEquals(3072, allocatedMB); Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario?
          jianhe Jian He added a comment -

          Didn't see Jason's comments, agree with his comments too.

          jianhe Jian He added a comment - Didn't see Jason's comments, agree with his comments too.
          junping_du Junping Du added a comment -

          I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before YARN-2997. In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks).

          I see. I think that's why we didn't notice this issue before. However, this bug should happen after YARN-2997, so we should mark affected version to be 2.7.

          I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM.

          I think we can do some code refactor work here. However, I think two things could be different between reconnect and regular resource update: 1. Port number could be changed (use ephemeral port when disable NM work preserving); 2. Resource could be updated (assume NM's resource could be updated before). Isn't it?

          junping_du Junping Du added a comment - I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before YARN-2997 . In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks). I see. I think that's why we didn't notice this issue before. However, this bug should happen after YARN-2997 , so we should mark affected version to be 2.7. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM. I think we can do some code refactor work here. However, I think two things could be different between reconnect and regular resource update: 1. Port number could be changed (use ephemeral port when disable NM work preserving); 2. Resource could be updated (assume NM's resource could be updated before). Isn't it?
          junping_du Junping Du added a comment -

          Update affect version to be 2.7. May be a blocker?

          junping_du Junping Du added a comment - Update affect version to be 2.7. May be a blocker?
          junping_du Junping Du added a comment -

          Should be a blocker to 2.7 as it blocks rolling upgrade feature which works in 2.6.

          junping_du Junping Du added a comment - Should be a blocker to 2.7 as it blocks rolling upgrade feature which works in 2.6.
          rohithsharma Rohith Sharma K S added a comment - - edited

          Thanks jlowe junping_du jianhe for detailed review

          the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition

          Agree, this has to be refactored. Majority of processing containerStatus code is same.

          we don't remove containers that have completed from the launchedContainers map which seems wrong

          I see, yes. completed containers should be removed from launchedContainers.

          I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM

          IIUC it is only to deal with NMContainerStatus and containerStatus. But I am not sure why these both created differently. What I see is containerStatus is subset of NMcontainerStatus. I think containerStatus would have been inside NMContainerStatus.

          Is below condition valid for the newly added code in ReconnectNodeTransition too ?

          Yes, it is applicable since we are keeping old RMNode object.

          Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ?

          Agree.

          Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario?

          Agree, I will add

          Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ?

          AM memory is 1024 and additional requested container memory is 2048. In test, number of request container is 1. So AllocatedMB should be AM+Requested i.e 1024+2048=3072

          rohithsharma Rohith Sharma K S added a comment - - edited Thanks jlowe junping_du jianhe for detailed review the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition Agree, this has to be refactored. Majority of processing containerStatus code is same. we don't remove containers that have completed from the launchedContainers map which seems wrong I see, yes. completed containers should be removed from launchedContainers. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM IIUC it is only to deal with NMContainerStatus and containerStatus. But I am not sure why these both created differently. What I see is containerStatus is subset of NMcontainerStatus. I think containerStatus would have been inside NMContainerStatus. Is below condition valid for the newly added code in ReconnectNodeTransition too ? Yes, it is applicable since we are keeping old RMNode object. Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ? Agree. Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario? Agree, I will add Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ? AM memory is 1024 and additional requested container memory is 2048. In test, number of request container is 1. So AllocatedMB should be AM+Requested i.e 1024+2048=3072

          Attached the patch addressing all the above comments.. Kindly review the new patch..

          rohithsharma Rohith Sharma K S added a comment - Attached the patch addressing all the above comments.. Kindly review the new patch..
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12699499/0001-YARN-3194.patch
          against trunk revision 2ecea5a.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 2 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          -1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

          Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6660//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6660//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
          Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6660//console

          This message is automatically generated.

          hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699499/0001-YARN-3194.patch against trunk revision 2ecea5a. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6660//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6660//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6660//console This message is automatically generated.

          FIndbugs warnings are unrelated to this Jira. These warnings will be handled as part of YARN-3204

          rohithsharma Rohith Sharma K S added a comment - FIndbugs warnings are unrelated to this Jira. These warnings will be handled as part of YARN-3204

          +1 lgtm. Will commit this tomorrow if there are no further comments.

          jlowe Jason Darrell Lowe added a comment - +1 lgtm. Will commit this tomorrow if there are no further comments.
          jianhe Jian He added a comment -

          lgtm too

          jianhe Jian He added a comment - lgtm too
          junping_du Junping Du added a comment -

          lgtm three.

          junping_du Junping Du added a comment - lgtm three.

          Thanks to Rohith for the contribution and to Jian and Junping for additional review! I committed this to trunk and branch-2.

          jlowe Jason Darrell Lowe added a comment - Thanks to Rohith for the contribution and to Jian and Junping for additional review! I committed this to trunk and branch-2.
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #7162 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7162/)
          YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7162 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7162/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #111 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/111/)
          YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/CHANGES.txt
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #111 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/111/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/CHANGES.txt
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #845 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/845/)
          YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #845 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/845/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk #2043 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2043/)
          YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2043 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2043/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #102 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/102/)
          YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/CHANGES.txt
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #102 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/102/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/CHANGES.txt
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/)
          YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          • hadoop-yarn-project/CHANGES.txt
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/)
          YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
          • hadoop-yarn-project/CHANGES.txt
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
          • hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java

          I committed this to branch-2.6 as well.

          jlowe Jason Darrell Lowe added a comment - I committed this to branch-2.6 as well.

          People

            rohithsharma Rohith Sharma K S
            rohithsharma Rohith Sharma K S
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: