[YARN-3194] RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node - ASF JIRA

Jian He added a comment - 13/Feb/15 18:41

rohithsharma, the completed containers are also notified to applications. mind clarify more how the application is hung ?

Jian He added a comment - 13/Feb/15 18:41 rohithsharma , the completed containers are also notified to applications. mind clarify more how the application is hung ?

Rohith Sharma K S added a comment - 14/Feb/15 03:37

The current flow of notifying applications for completed containers are via RM.

AM sends stopContainerRequest to NM
NM kills the container and send containerStatus to RM in heartbeat ONLY ONCE. NM keeps this container until RM piggybacks to NM saying container can be removed.
RM releases the resources which are allocated for containers. And sends back to NM for removing container from NM's tracking.
In AM heartbeat to AM, RM sends completed containers details to AM

ICO NM Restart, NM sends outstanding NMContainerStatus lists to RM during NM registration. This status has boh RUNNING and COMPLETED containers.These containers status never send again by RM. But on NM registration , RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored. The impact from this is Resources which are being allocated to these containers never be released from RM. At this time, RM state is completely inconsistent with actual cluster state.

Application is hanging because in the above inconsistent state, RM never allocate new containers which are asked by AM.So AM will be keep waiting for containers from RM.

Earlier to ~~YARN-2997~~ fix, it should work fine because those completed containers which are sent in NM registration are sent again in 1st heartbeat of NM. So RMNodeImpl handles completed containers in nodeHeartBeat event. The bug was hidden before yarn-2997 because NM was sending duplicated container status. But actually RM should had handle it.
Does it make sense?

Rohith Sharma K S added a comment - 14/Feb/15 03:37 The current flow of notifying applications for completed containers are via RM. AM sends stopContainerRequest to NM NM kills the container and send containerStatus to RM in heartbeat ONLY ONCE. NM keeps this container until RM piggybacks to NM saying container can be removed. RM releases the resources which are allocated for containers. And sends back to NM for removing container from NM's tracking. In AM heartbeat to AM, RM sends completed containers details to AM ICO NM Restart, NM sends outstanding NMContainerStatus lists to RM during NM registration. This status has boh RUNNING and COMPLETED containers.These containers status never send again by RM. But on NM registration , RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored. The impact from this is Resources which are being allocated to these containers never be released from RM. At this time, RM state is completely inconsistent with actual cluster state. Application is hanging because in the above inconsistent state, RM never allocate new containers which are asked by AM.So AM will be keep waiting for containers from RM. Earlier to YARN-2997 fix, it should work fine because those completed containers which are sent in NM registration are sent again in 1st heartbeat of NM. So RMNodeImpl handles completed containers in nodeHeartBeat event. The bug was hidden before yarn-2997 because NM was sending duplicated container status. But actually RM should had handle it. Does it make sense?

Jian He added a comment - 14/Feb/15 04:02

RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored.

Completed containers are also processed. please refer to {{RMContainerImpl#ContainerRecoveredTransition}.
Both running and completed containers sent by NM on re-registration will be processed by the new RM and routed back to the AM.

Jian He added a comment - 14/Feb/15 04:02 RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored. Completed containers are also processed. please refer to {{RMContainerImpl#ContainerRecoveredTransition}. Both running and completed containers sent by NM on re-registration will be processed by the new RM and routed back to the AM.

Jian He added a comment - 14/Feb/15 04:06

I'm downgrading the priority for now. Please raise the priority if you still think this is a real issue.

Jian He added a comment - 14/Feb/15 04:06 I'm downgrading the priority for now. Please raise the priority if you still think this is a real issue.

Rohith Sharma K S added a comment - 15/Feb/15 18:54

Thanks jianhe for pointing me out container recovery flow!! Issue priority can decided later,not a problem.

I had deeper look about NM registration flow. There are 2 scenario's can occur

Node added event : Again here 2 scenario's can occur
1. New node is registering with different ip:port – NOT A PROBLEM
2. Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM
Node reconnected event :
1. Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted
  1. NM RESTART NOT Enabled – NOT A PROBLEM
  2. NM RESTART is Enabled – Problem is here
    When Node is reconnected and applications are running in that node, NMContainerStatus are ignored. I think RMNodeReconnectEvent should consider NMContainerStatus and process it.

Rohith Sharma K S added a comment - 15/Feb/15 18:54 Thanks jianhe for pointing me out container recovery flow!! Issue priority can decided later,not a problem. I had deeper look about NM registration flow. There are 2 scenario's can occur Node added event : Again here 2 scenario's can occur New node is registering with different ip:port – NOT A PROBLEM Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM Node reconnected event : Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted NM RESTART NOT Enabled – NOT A PROBLEM NM RESTART is Enabled – Problem is here When Node is reconnected and applications are running in that node, NMContainerStatus are ignored. I think RMNodeReconnectEvent should consider NMContainerStatus and process it.

Rohith Sharma K S added a comment - 15/Feb/15 18:57

And Not related specific to this jira, I have one doubt on this method ResourceTrackerService#handleNMContainerStatus is sending container_finished event only to master container. Why other containers are not considered? I think it is made intentionally for optimization so if container_finished event would release other containers resources. Is this is the reason?

Rohith Sharma K S added a comment - 15/Feb/15 18:57 And Not related specific to this jira, I have one doubt on this method ResourceTrackerService#handleNMContainerStatus is sending container_finished event only to master container. Why other containers are not considered? I think it is made intentionally for optimization so if container_finished event would release other containers resources. Is this is the reason?

Jian He added a comment - 15/Feb/15 19:31

I have one doubt on this method ResourceTrackerService#handleNMContainerStatus

This is legacy code for non-work-preserving restart. we could remove that. Just disregard this method.

NM RESTART is Enabled – Problem is here

For node_reconnect event, it's removing the old node and adding the newly connected node. RM is also not restarted. I don't think we need to handle the RMNodeReconnectEvent

Jian He added a comment - 15/Feb/15 19:31 I have one doubt on this method ResourceTrackerService#handleNMContainerStatus This is legacy code for non-work-preserving restart. we could remove that. Just disregard this method. NM RESTART is Enabled – Problem is here For node_reconnect event, it's removing the old node and adding the newly connected node. RM is also not restarted. I don't think we need to handle the RMNodeReconnectEvent

Rohith Sharma K S added a comment - 16/Feb/15 00:58

it's removing the old node and adding the newly connected node. RM is also not restarted.

RMNodeImpl#ReconnectNodeTransition#.transition does not remove old node if any applications are running. In the below code, if noRunningApps is false then Node is not removed. Instead just handling running applications.

public void transition(RMNodeImpl rmNode, RMNodeEvent event) {
      RMNodeReconnectEvent reconnectEvent = (RMNodeReconnectEvent) event;
      RMNode newNode = reconnectEvent.getReconnectedNode();
      rmNode.nodeManagerVersion = newNode.getNodeManagerVersion();
      List<ApplicationId> runningApps = reconnectEvent.getRunningApplications();
      boolean noRunningApps = 
          (runningApps == null) || (runningApps.size() == 0);
      
      // No application running on the node, so send node-removal event with 
      // cleaning up old container info.
      if (noRunningApps) {
        // Remove the node from scheduler
        // Add node to the scheduler
      } else {
        rmNode.httpPort = newNode.getHttpPort();
        rmNode.httpAddress = newNode.getHttpAddress();
        rmNode.totalCapability = newNode.getTotalCapability();
      
        // Reset heartbeat ID since node just restarted.
        rmNode.getLastNodeHeartBeatResponse().setResponseId(0);
      }

      // Handles running app on this node
    // resource update to schedule code
      }
      
    }

Rohith Sharma K S added a comment - 16/Feb/15 00:58 it's removing the old node and adding the newly connected node. RM is also not restarted. RMNodeImpl#ReconnectNodeTransition#.transition does not remove old node if any applications are running. In the below code, if noRunningApps is false then Node is not removed. Instead just handling running applications. public void transition(RMNodeImpl rmNode, RMNodeEvent event) { RMNodeReconnectEvent reconnectEvent = (RMNodeReconnectEvent) event; RMNode newNode = reconnectEvent.getReconnectedNode(); rmNode.nodeManagerVersion = newNode.getNodeManagerVersion(); List<ApplicationId> runningApps = reconnectEvent.getRunningApplications(); boolean noRunningApps = (runningApps == null ) || (runningApps.size() == 0); // No application running on the node, so send node-removal event with // cleaning up old container info. if (noRunningApps) { // Remove the node from scheduler // Add node to the scheduler } else { rmNode.httpPort = newNode.getHttpPort(); rmNode.httpAddress = newNode.getHttpAddress(); rmNode.totalCapability = newNode.getTotalCapability(); // Reset heartbeat ID since node just restarted. rmNode.getLastNodeHeartBeatResponse().setResponseId(0); } // Handles running app on this node // resource update to schedule code } }

Devaraj Kavali added a comment - 16/Feb/15 11:52

Thanks rohithsharma for reporting and jianhe for your inputs.

I am also able to reproduce this issue, I see that RM is not getting information about the completed containers after NM restart as part of the nodeheartbeat request and also AM is not informing the RM to release these completed containers. As a result RM is assuming these containers are running and not releasing these completed container resources.

Devaraj Kavali added a comment - 16/Feb/15 11:52 Thanks rohithsharma for reporting and jianhe for your inputs. I am also able to reproduce this issue, I see that RM is not getting information about the completed containers after NM restart as part of the nodeheartbeat request and also AM is not informing the RM to release these completed containers. As a result RM is assuming these containers are running and not releasing these completed container resources.

Rohith Sharma K S added a comment - 16/Feb/15 18:48

Attached the version-1 patch.
The patch does following

Added ReconnectedEvent to process NMContainerStatus if applications are running on the node

Kindly review the patch

Rohith Sharma K S added a comment - 16/Feb/15 18:48 Attached the version-1 patch. The patch does following Added ReconnectedEvent to process NMContainerStatus if applications are running on the node Kindly review the patch

Hadoop QA added a comment - 16/Feb/15 20:19

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12699148/0001-yarn-3194-v1.patch
against trunk revision 814afa4.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 2 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

-1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

-1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore

Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6647//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6647//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6647//console

This message is automatically generated.

Hadoop QA added a comment - 16/Feb/15 20:19 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699148/0001-yarn-3194-v1.patch against trunk revision 814afa4. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6647//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6647//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6647//console This message is automatically generated.

Junping Du added a comment - 17/Feb/15 09:16

I think NM after restarted will try to relaunch these running containers as RecoveredContainers, and if it cannot locate the pid (assume container get completed during NM downtime), it would report and trigger the complete of containers. Do I miss anything here?
jlowe, I remember we discussed this case in some JIRA under ~~YARN-1336~~, did you see this problem before?

Junping Du added a comment - 17/Feb/15 09:16 I think NM after restarted will try to relaunch these running containers as RecoveredContainers, and if it cannot locate the pid (assume container get completed during NM downtime), it would report and trigger the complete of containers. Do I miss anything here? jlowe , I remember we discussed this case in some JIRA under YARN-1336 , did you see this problem before?

Rohith Sharma K S added a comment - 17/Feb/15 09:31 - edited

junping_du I see the same behaviour which you explained after NM restart from NM logs.

Rohith Sharma K S added a comment - 17/Feb/15 09:31 - edited junping_du I see the same behaviour which you explained after NM restart from NM logs.

Jason Darrell Lowe added a comment - 17/Feb/15 21:50

Jason Lowe, I remember we discussed this case in some JIRA under ~~YARN-1336~~, did you see this problem before?

I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before ~~YARN-2997~~. In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks).

I agree that we should be processing the container report sent with the NM registration, and it appears that is being dropped in the reconnected event.

Comments on the patch:

I noticed that the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition. One difference is that we don't remove containers that have completed from the launchedContainers map which seems wrong. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM. Therefore I think we should refactor the code to reuse this logic, as it should apply here just as it does for StatusUpdateWhenHealthyTransition.

Jason Darrell Lowe added a comment - 17/Feb/15 21:50 Jason Lowe, I remember we discussed this case in some JIRA under YARN-1336 , did you see this problem before? I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before YARN-2997 . In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks). I agree that we should be processing the container report sent with the NM registration, and it appears that is being dropped in the reconnected event. Comments on the patch: I noticed that the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition. One difference is that we don't remove containers that have completed from the launchedContainers map which seems wrong. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM. Therefore I think we should refactor the code to reuse this logic, as it should apply here just as it does for StatusUpdateWhenHealthyTransition.

Jian He added a comment - 17/Feb/15 22:18

rohithsharma, thanks for your explanation. could you edit the description to be more clear about the problem ?

Is it possible to have a common method for below code in ReconnectNodeTransition and StatusUpdateWhenHealthyTransition ?

        // Filter the map to only obtain just launched containers and finished
        // containers.
        List<ContainerStatus> newlyLaunchedContainers =
            new ArrayList<ContainerStatus>();
        List<ContainerStatus> completedContainers =
            new ArrayList<ContainerStatus>();
        for (NMContainerStatus remoteContainer : reconnectEvent
            .getNMContainerStatuses()) {
          ContainerId containerId = remoteContainer.getContainerId();

          // Process running containers
          if (remoteContainer.getContainerState() == ContainerState.RUNNING) {
            if (!rmNode.launchedContainers.contains(containerId)) {
              // Just launched container. RM knows about it the first time.
              rmNode.launchedContainers.add(containerId);
              ContainerStatus cStatus = createContainerStatus(remoteContainer);
              newlyLaunchedContainers.add(cStatus);
            }
          } else {

            ContainerStatus cStatus = createContainerStatus(remoteContainer);
            completedContainers.add(cStatus);
          }
        }
        if (newlyLaunchedContainers.size() != 0
            || completedContainers.size() != 0) {
          rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo(
              newlyLaunchedContainers, completedContainers));
        }

Is below condition valid for the newly added code in ReconnectNodeTransition too ?

        // Don't bother with containers already scheduled for cleanup, or for
        // applications already killed. The scheduler doens't need to know any
        // more about this container
        if (rmNode.containersToClean.contains(containerId)) {
          LOG.info("Container " + containerId + " already scheduled for " +
          		"cleanup, no further processing");
          continue;
        }
        if (rmNode.finishedApplications.contains(containerId
            .getApplicationAttemptId().getApplicationId())) {
          LOG.info("Container " + containerId
              + " belongs to an application that is already killed,"
              + " no further processing");
          continue;
        }

Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ?
```
@Test
  public void testAppCleanupWhenNMRstarts() throws Exception
```
Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ?
```
 Assert.assertEquals(3072, allocatedMB);
```
Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario?

Jian He added a comment - 17/Feb/15 22:18 rohithsharma , thanks for your explanation. could you edit the description to be more clear about the problem ? Is it possible to have a common method for below code in ReconnectNodeTransition and StatusUpdateWhenHealthyTransition ? // Filter the map to only obtain just launched containers and finished // containers. List<ContainerStatus> newlyLaunchedContainers = new ArrayList<ContainerStatus>(); List<ContainerStatus> completedContainers = new ArrayList<ContainerStatus>(); for (NMContainerStatus remoteContainer : reconnectEvent .getNMContainerStatuses()) { ContainerId containerId = remoteContainer.getContainerId(); // Process running containers if (remoteContainer.getContainerState() == ContainerState.RUNNING) { if (!rmNode.launchedContainers.contains(containerId)) { // Just launched container. RM knows about it the first time. rmNode.launchedContainers.add(containerId); ContainerStatus cStatus = createContainerStatus(remoteContainer); newlyLaunchedContainers.add(cStatus); } } else { ContainerStatus cStatus = createContainerStatus(remoteContainer); completedContainers.add(cStatus); } } if (newlyLaunchedContainers.size() != 0 || completedContainers.size() != 0) { rmNode.nodeUpdateQueue.add( new UpdatedContainerInfo( newlyLaunchedContainers, completedContainers)); } Is below condition valid for the newly added code in ReconnectNodeTransition too ? // Don't bother with containers already scheduled for cleanup, or for // applications already killed. The scheduler doens't need to know any // more about this container if (rmNode.containersToClean.contains(containerId)) { LOG.info( "Container " + containerId + " already scheduled for " + "cleanup, no further processing" ); continue ; } if (rmNode.finishedApplications.contains(containerId .getApplicationAttemptId().getApplicationId())) { LOG.info( "Container " + containerId + " belongs to an application that is already killed," + " no further processing" ); continue ; } Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ? @Test public void testAppCleanupWhenNMRstarts() throws Exception Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ? Assert.assertEquals(3072, allocatedMB); Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario?

Jian He added a comment - 17/Feb/15 22:22

Didn't see Jason's comments, agree with his comments too.

Jian He added a comment - 17/Feb/15 22:22 Didn't see Jason's comments, agree with his comments too.

Junping Du added a comment - 18/Feb/15 00:43

I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before ~~YARN-2997~~. In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks).

I see. I think that's why we didn't notice this issue before. However, this bug should happen after ~~YARN-2997~~, so we should mark affected version to be 2.7.

I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM.

I think we can do some code refactor work here. However, I think two things could be different between reconnect and regular resource update: 1. Port number could be changed (use ephemeral port when disable NM work preserving); 2. Resource could be updated (assume NM's resource could be updated before). Isn't it?

Junping Du added a comment - 18/Feb/15 00:43 I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before YARN-2997 . In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks). I see. I think that's why we didn't notice this issue before. However, this bug should happen after YARN-2997 , so we should mark affected version to be 2.7. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM. I think we can do some code refactor work here. However, I think two things could be different between reconnect and regular resource update: 1. Port number could be changed (use ephemeral port when disable NM work preserving); 2. Resource could be updated (assume NM's resource could be updated before). Isn't it?

Junping Du added a comment - 18/Feb/15 00:44

Update affect version to be 2.7. May be a blocker?

Junping Du added a comment - 18/Feb/15 00:44 Update affect version to be 2.7. May be a blocker?

Junping Du added a comment - 18/Feb/15 01:29

Should be a blocker to 2.7 as it blocks rolling upgrade feature which works in 2.6.

Junping Du added a comment - 18/Feb/15 01:29 Should be a blocker to 2.7 as it blocks rolling upgrade feature which works in 2.6.

Rohith Sharma K S added a comment - 18/Feb/15 09:01 - edited

Thanks jlowe junping_du jianhe for detailed review

the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition

Agree, this has to be refactored. Majority of processing containerStatus code is same.

we don't remove containers that have completed from the launchedContainers map which seems wrong

I see, yes. completed containers should be removed from launchedContainers.

I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM

IIUC it is only to deal with NMContainerStatus and containerStatus. But I am not sure why these both created differently. What I see is containerStatus is subset of NMcontainerStatus. I think containerStatus would have been inside NMContainerStatus.

Is below condition valid for the newly added code in ReconnectNodeTransition too ?

Yes, it is applicable since we are keeping old RMNode object.

Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ?

Agree.

Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario?

Agree, I will add

Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ?

AM memory is 1024 and additional requested container memory is 2048. In test, number of request container is 1. So AllocatedMB should be AM+Requested i.e 1024+2048=3072

Rohith Sharma K S added a comment - 18/Feb/15 09:01 - edited Thanks jlowe junping_du jianhe for detailed review the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition Agree, this has to be refactored. Majority of processing containerStatus code is same. we don't remove containers that have completed from the launchedContainers map which seems wrong I see, yes. completed containers should be removed from launchedContainers. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM IIUC it is only to deal with NMContainerStatus and containerStatus. But I am not sure why these both created differently. What I see is containerStatus is subset of NMcontainerStatus. I think containerStatus would have been inside NMContainerStatus. Is below condition valid for the newly added code in ReconnectNodeTransition too ? Yes, it is applicable since we are keeping old RMNode object. Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ? Agree. Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario? Agree, I will add Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ? AM memory is 1024 and additional requested container memory is 2048. In test, number of request container is 1. So AllocatedMB should be AM+Requested i.e 1024+2048=3072

Rohith Sharma K S added a comment - 18/Feb/15 17:10

Attached the patch addressing all the above comments.. Kindly review the new patch..

Rohith Sharma K S added a comment - 18/Feb/15 17:10 Attached the patch addressing all the above comments.. Kindly review the new patch..

Hadoop QA added a comment - 18/Feb/15 18:33

-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12699499/0001-YARN-3194.patch
against trunk revision 2ecea5a.

+1 @author. The patch does not contain any @author tags.

+1 tests included. The patch appears to include 2 new or modified test files.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

-1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6660//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6660//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6660//console

This message is automatically generated.

Hadoop QA added a comment - 18/Feb/15 18:33 -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12699499/0001-YARN-3194.patch against trunk revision 2ecea5a. +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 2 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6660//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6660//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6660//console This message is automatically generated.

Rohith Sharma K S added a comment - 19/Feb/15 02:40

FIndbugs warnings are unrelated to this Jira. These warnings will be handled as part of ~~YARN-3204~~

Rohith Sharma K S added a comment - 19/Feb/15 02:40 FIndbugs warnings are unrelated to this Jira. These warnings will be handled as part of YARN-3204

Jason Darrell Lowe added a comment - 19/Feb/15 15:27

+1 lgtm. Will commit this tomorrow if there are no further comments.

Jason Darrell Lowe added a comment - 19/Feb/15 15:27 +1 lgtm. Will commit this tomorrow if there are no further comments.

Jian He added a comment - 19/Feb/15 18:47

lgtm too

Jian He added a comment - 19/Feb/15 18:47 lgtm too

Junping Du added a comment - 19/Feb/15 23:41

lgtm three.

Junping Du added a comment - 19/Feb/15 23:41 lgtm three.

Jason Darrell Lowe added a comment - 20/Feb/15 15:14

Thanks to Rohith for the contribution and to Jian and Junping for additional review! I committed this to trunk and branch-2.

Jason Darrell Lowe added a comment - 20/Feb/15 15:14 Thanks to Rohith for the contribution and to Jian and Junping for additional review! I committed this to trunk and branch-2.

Hudson added a comment - 20/Feb/15 15:24

FAILURE: Integrated in Hadoop-trunk-Commit #7162 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7162/)
~~YARN-3194~~. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
hadoop-yarn-project/CHANGES.txt
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java

Hudson added a comment - 20/Feb/15 15:24 FAILURE: Integrated in Hadoop-trunk-Commit #7162 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7162/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java

Hudson added a comment - 21/Feb/15 11:42

FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #111 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/111/)
~~YARN-3194~~. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
hadoop-yarn-project/CHANGES.txt

Hudson added a comment - 21/Feb/15 11:42 FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #111 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/111/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/CHANGES.txt

Hudson added a comment - 21/Feb/15 12:17

SUCCESS: Integrated in Hadoop-Yarn-trunk #845 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/845/)
~~YARN-3194~~. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
hadoop-yarn-project/CHANGES.txt
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java

Hudson added a comment - 21/Feb/15 12:17 SUCCESS: Integrated in Hadoop-Yarn-trunk #845 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/845/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java

Hudson added a comment - 21/Feb/15 15:31

FAILURE: Integrated in Hadoop-Hdfs-trunk #2043 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2043/)
~~YARN-3194~~. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
hadoop-yarn-project/CHANGES.txt
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java

Hudson added a comment - 21/Feb/15 15:31 FAILURE: Integrated in Hadoop-Hdfs-trunk #2043 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2043/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java

Hudson added a comment - 21/Feb/15 15:49

FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #102 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/102/)
~~YARN-3194~~. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
hadoop-yarn-project/CHANGES.txt

Hudson added a comment - 21/Feb/15 15:49 FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #102 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/102/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/CHANGES.txt

Hudson added a comment - 22/Feb/15 14:58

FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/)
~~YARN-3194~~. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
hadoop-yarn-project/CHANGES.txt

Hudson added a comment - 22/Feb/15 14:58 FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java hadoop-yarn-project/CHANGES.txt

Hudson added a comment - 22/Feb/15 15:25

FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/)
~~YARN-3194~~. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)

hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
hadoop-yarn-project/CHANGES.txt
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java

Hudson added a comment - 22/Feb/15 15:25 FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/ ) YARN-3194 . RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java hadoop-yarn-project/CHANGES.txt hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java

Jason Darrell Lowe added a comment - 08/Oct/15 16:34

I committed this to branch-2.6 as well.

Jason Darrell Lowe added a comment - 08/Oct/15 16:34 I committed this to branch-2.6 as well.

Hadoop YARN

RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node

Details

Description

Attachments

Attachments

Activity

People

Dates