Details
-
Bug
-
Status: Closed
-
Blocker
-
Resolution: Fixed
-
2.7.0
-
None
-
NM restart is enabled
-
Reviewed
Description
On NM restart ,NM sends all the outstanding NMContainerStatus to RM during registration. The registration can be treated by RM as New node or Reconnecting node. RM triggers corresponding event on the basis of node added or node reconnected state.
- Node added event : Again here 2 scenario's can occur
- New node is registering with different ip:port – NOT A PROBLEM
- Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM
- Node reconnected event :
- Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted
- NM RESTART NOT Enabled – NOT A PROBLEM
- NM RESTART is Enabled
- Some applications are running on this node – Problem is here
- Zero applications are running on this node – NOT A PROBLEM
- Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted
Since NMContainerStatus are not handled, RM never get to know about completedContainer and never release resource held be containers. RM will not allocate new containers for pending resource request as long as the completedContainer event is triggered. This results in applications to wait indefinitly because of pending containers are not served by RM.
Attachments
Attachments
- 0001-yarn-3194-v1.patch
- 13 kB
- Rohith Sharma K S
- 0001-YARN-3194.patch
- 18 kB
- Rohith Sharma K S
Activity
The current flow of notifying applications for completed containers are via RM.
- AM sends stopContainerRequest to NM
- NM kills the container and send containerStatus to RM in heartbeat ONLY ONCE. NM keeps this container until RM piggybacks to NM saying container can be removed.
- RM releases the resources which are allocated for containers. And sends back to NM for removing container from NM's tracking.
- In AM heartbeat to AM, RM sends completed containers details to AM
ICO NM Restart, NM sends outstanding NMContainerStatus lists to RM during NM registration. This status has boh RUNNING and COMPLETED containers.These containers status never send again by RM. But on NM registration , RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored. The impact from this is Resources which are being allocated to these containers never be released from RM. At this time, RM state is completely inconsistent with actual cluster state.
Application is hanging because in the above inconsistent state, RM never allocate new containers which are asked by AM.So AM will be keep waiting for containers from RM.
Earlier to YARN-2997 fix, it should work fine because those completed containers which are sent in NM registration are sent again in 1st heartbeat of NM. So RMNodeImpl handles completed containers in nodeHeartBeat event. The bug was hidden before yarn-2997 because NM was sending duplicated container status. But actually RM should had handle it.
Does it make sense?
RM(RMNodeImpl.AddNodeTransition#transition) is processing only RUNNING containers. COMPLETED containers are ignored.
Completed containers are also processed. please refer to {{RMContainerImpl#ContainerRecoveredTransition}.
Both running and completed containers sent by NM on re-registration will be processed by the new RM and routed back to the AM.
I'm downgrading the priority for now. Please raise the priority if you still think this is a real issue.
Thanks jianhe for pointing me out container recovery flow!! Issue priority can decided later,not a problem.
I had deeper look about NM registration flow. There are 2 scenario's can occur
- Node added event : Again here 2 scenario's can occur
- New node is registering with different ip:port – NOT A PROBLEM
- Old node is re-registering because of RESYNC command from RM when RM restart – NOT A PROBLEM
- Node reconnected event :
- Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted
- NM RESTART NOT Enabled – NOT A PROBLEM
- NM RESTART is Enabled – Problem is here
When Node is reconnected and applications are running in that node, NMContainerStatus are ignored. I think RMNodeReconnectEvent should consider NMContainerStatus and process it.
- Existing node is re-registering i.e RM treat it as reconnecting node when RM is not restarted
And Not related specific to this jira, I have one doubt on this method ResourceTrackerService#handleNMContainerStatus is sending container_finished event only to master container. Why other containers are not considered? I think it is made intentionally for optimization so if container_finished event would release other containers resources. Is this is the reason?
I have one doubt on this method ResourceTrackerService#handleNMContainerStatus
This is legacy code for non-work-preserving restart. we could remove that. Just disregard this method.
NM RESTART is Enabled – Problem is here
For node_reconnect event, it's removing the old node and adding the newly connected node. RM is also not restarted. I don't think we need to handle the RMNodeReconnectEvent
it's removing the old node and adding the newly connected node. RM is also not restarted.
RMNodeImpl#ReconnectNodeTransition#.transition does not remove old node if any applications are running. In the below code, if noRunningApps is false then Node is not removed. Instead just handling running applications.
public void transition(RMNodeImpl rmNode, RMNodeEvent event) { RMNodeReconnectEvent reconnectEvent = (RMNodeReconnectEvent) event; RMNode newNode = reconnectEvent.getReconnectedNode(); rmNode.nodeManagerVersion = newNode.getNodeManagerVersion(); List<ApplicationId> runningApps = reconnectEvent.getRunningApplications(); boolean noRunningApps = (runningApps == null) || (runningApps.size() == 0); // No application running on the node, so send node-removal event with // cleaning up old container info. if (noRunningApps) { // Remove the node from scheduler // Add node to the scheduler } else { rmNode.httpPort = newNode.getHttpPort(); rmNode.httpAddress = newNode.getHttpAddress(); rmNode.totalCapability = newNode.getTotalCapability(); // Reset heartbeat ID since node just restarted. rmNode.getLastNodeHeartBeatResponse().setResponseId(0); } // Handles running app on this node // resource update to schedule code } }
Thanks rohithsharma for reporting and jianhe for your inputs.
I am also able to reproduce this issue, I see that RM is not getting information about the completed containers after NM restart as part of the nodeheartbeat request and also AM is not informing the RM to release these completed containers. As a result RM is assuming these containers are running and not releasing these completed container resources.
Attached the version-1 patch.
The patch does following
- Added ReconnectedEvent to process NMContainerStatus if applications are running on the node
Kindly review the patch
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12699148/0001-yarn-3194-v1.patch
against trunk revision 814afa4.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
org.apache.hadoop.yarn.server.resourcemanager.recovery.TestFSRMStateStore
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6647//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6647//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6647//console
This message is automatically generated.
I think NM after restarted will try to relaunch these running containers as RecoveredContainers, and if it cannot locate the pid (assume container get completed during NM downtime), it would report and trigger the complete of containers. Do I miss anything here?
jlowe, I remember we discussed this case in some JIRA under YARN-1336, did you see this problem before?
Jason Lowe, I remember we discussed this case in some JIRA under
YARN-1336, did you see this problem before?
I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before YARN-2997. In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks).
I agree that we should be processing the container report sent with the NM registration, and it appears that is being dropped in the reconnected event.
Comments on the patch:
I noticed that the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition. One difference is that we don't remove containers that have completed from the launchedContainers map which seems wrong. I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM. Therefore I think we should refactor the code to reuse this logic, as it should apply here just as it does for StatusUpdateWhenHealthyTransition.
rohithsharma, thanks for your explanation. could you edit the description to be more clear about the problem ?
- Is it possible to have a common method for below code in ReconnectNodeTransition and StatusUpdateWhenHealthyTransition ?
// Filter the map to only obtain just launched containers and finished // containers. List<ContainerStatus> newlyLaunchedContainers = new ArrayList<ContainerStatus>(); List<ContainerStatus> completedContainers = new ArrayList<ContainerStatus>(); for (NMContainerStatus remoteContainer : reconnectEvent .getNMContainerStatuses()) { ContainerId containerId = remoteContainer.getContainerId(); // Process running containers if (remoteContainer.getContainerState() == ContainerState.RUNNING) { if (!rmNode.launchedContainers.contains(containerId)) { // Just launched container. RM knows about it the first time. rmNode.launchedContainers.add(containerId); ContainerStatus cStatus = createContainerStatus(remoteContainer); newlyLaunchedContainers.add(cStatus); } } else { ContainerStatus cStatus = createContainerStatus(remoteContainer); completedContainers.add(cStatus); } } if (newlyLaunchedContainers.size() != 0 || completedContainers.size() != 0) { rmNode.nodeUpdateQueue.add(new UpdatedContainerInfo( newlyLaunchedContainers, completedContainers)); }
- Is below condition valid for the newly added code in ReconnectNodeTransition too ?
// Don't bother with containers already scheduled for cleanup, or for // applications already killed. The scheduler doens't need to know any // more about this container if (rmNode.containersToClean.contains(containerId)) { LOG.info("Container " + containerId + " already scheduled for " + "cleanup, no further processing"); continue; } if (rmNode.finishedApplications.contains(containerId .getApplicationAttemptId().getApplicationId())) { LOG.info("Container " + containerId + " belongs to an application that is already killed," + " no further processing"); continue; }
- Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ?
@Test public void testAppCleanupWhenNMRstarts() throws Exception
- Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ?
Assert.assertEquals(3072, allocatedMB);
- Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario?
I didn't see this problem originally, but I suspect it was because there were two things that masked it. As mentioned above, this problem doesn't manifest before
YARN-2997. In addition, I was testing it with MapReduce applications, and the MR AM will explicitly kill containers for tasks that have completed (as reported by the umbilical connection between the AM and tasks).
I see. I think that's why we didn't notice this issue before. However, this bug should happen after YARN-2997, so we should mark affected version to be 2.7.
I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM.
I think we can do some code refactor work here. However, I think two things could be different between reconnect and regular resource update: 1. Port number could be changed (use ephemeral port when disable NM work preserving); 2. Resource could be updated (assume NM's resource could be updated before). Isn't it?
Should be a blocker to 2.7 as it blocks rolling upgrade feature which works in 2.6.
Thanks jlowe junping_du jianhe for detailed review
the container status processing code is almost a duplicate of the same code in StatusUpdateWhenHealthyTransition
Agree, this has to be refactored. Majority of processing containerStatus code is same.
we don't remove containers that have completed from the launchedContainers map which seems wrong
I see, yes. completed containers should be removed from launchedContainers.
I don't see why we would process container status sent during a reconnect differently than a regular status update from the NM
IIUC it is only to deal with NMContainerStatus and containerStatus. But I am not sure why these both created differently. What I see is containerStatus is subset of NMcontainerStatus. I think containerStatus would have been inside NMContainerStatus.
Is below condition valid for the newly added code in ReconnectNodeTransition too ?
Yes, it is applicable since we are keeping old RMNode object.
Add timeout to the test, testAppCleanupWhenNMRstarts -> testProcessingContainerStatusesOnNMRestart ? and add more detailed comments about what the test is doing too ?
Agree.
Could you add a validation that ApplicationMasterService#allocate indeed receives the completed container in this scenario?
Agree, I will add
Question: does the 3072 include 1024 for the AM container and 2048 for the allocated container ?
AM memory is 1024 and additional requested container memory is 2048. In test, number of request container is 1. So AllocatedMB should be AM+Requested i.e 1024+2048=3072
Attached the patch addressing all the above comments.. Kindly review the new patch..
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12699499/0001-YARN-3194.patch
against trunk revision 2ecea5a.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 2 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6660//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6660//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6660//console
This message is automatically generated.
FIndbugs warnings are unrelated to this Jira. These warnings will be handled as part of YARN-3204
+1 lgtm. Will commit this tomorrow if there are no further comments.
Thanks to Rohith for the contribution and to Jian and Junping for additional review! I committed this to trunk and branch-2.
FAILURE: Integrated in Hadoop-trunk-Commit #7162 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7162/)
YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
- hadoop-yarn-project/CHANGES.txt
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #111 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/111/)
YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
- hadoop-yarn-project/CHANGES.txt
SUCCESS: Integrated in Hadoop-Yarn-trunk #845 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/845/)
YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
- hadoop-yarn-project/CHANGES.txt
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
FAILURE: Integrated in Hadoop-Hdfs-trunk #2043 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2043/)
YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
- hadoop-yarn-project/CHANGES.txt
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #102 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/102/)
YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
- hadoop-yarn-project/CHANGES.txt
FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #112 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/112/)
YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
- hadoop-yarn-project/CHANGES.txt
FAILURE: Integrated in Hadoop-Mapreduce-trunk #2062 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2062/)
YARN-3194. RM should handle NMContainerStatuses sent by NM while registering if NM is Reconnected node. Contributed by Rohith (jlowe: rev a64dd3d24bfcb9af21eb63869924f6482b147fd3)
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeReconnectEvent.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java
- hadoop-yarn-project/CHANGES.txt
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceTrackerService.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestApplicationCleanup.java
- hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMNodeTransitions.java
rohithsharma, the completed containers are also notified to applications. mind clarify more how the application is hung ?