Thanks Jon for the Review!
Re-title this jira since this is not a test problem according to the patch, but a race condition in the MRAppMaster that is exposed most frequently via this test.
I have renamed this JIRA accordingly. I would like to make one clarification here. I do not see this problem as a race. The bug is exposed by this test running with JDK7 and its random ordering.
Add your analysis to the jira so that the actual problem is documented and captured for future use.
This failure is intermittent. It is only caused when the test TestStagingCleanup runs in a particular order. For example, testDeletionofStagingOnReboot() followed by testDeletionofStagingOnKillLastTry()
The reason for the failure is due to the notifyIsLastAMRetry(). When this function is called, it calls setForcejobCompletion(). If the appMaster.stop() is called after the setForcejobCompletion(), it tries to stop the appMaster which was already forced to stop. As a result, it gets an NPE trying to stop the appMaster. If the appMaster.stop() is called in the first place, we won't get the NPE when it tries forceJobCompletion as there already is a null check before it proceeds.
hook.run() is also called in testDeletionofStagingOnKill(). But we do not get the NPE in that case. The reason for this is, in this test, we have 4 app attempts. MRAppMaster appMaster = new TestMRApp(attemptId, mockAlloc, 4);
where as in testDeletionofStagingOnKillLastTry() we have only 1 attempt to make sure there is no retry. MRAppMaster appMaster = new TestMRApp(attemptId, mockAlloc, 1); //no retry
Please determine if the java7 label is still accurate based on your analysis
We still need the java7 label as the TestStagingCleanup will not always fail without this fix. It only fails when the tests run in a particular order.