My test was doing something wrong. After I fixed that, the 001 patch stopped helping (which makes more sense because that code never actually did DECOMMISSIONG --> UNHEALTHY).
I put back the code that
YARN-4676 removed that you mentioned, but tweaked it a little bit and moved it above the getIsNodeHealthy call so that it can transition to DECOMMISSIONED even if a node is UNHEALTHY now.
I temporarily added a bunch more log statements to help investigate, and saw that sometimes handleContainerStatus (when called from StatusUpdateWhenHealthyTransition) would add an Application to runningApplications, but then nothing ever removed it. This happened way more frequently for DECOMMISSIONING nodes, but I did see it once happen to a normal node. There's a piece of code here that adds an Application to runningApplications if it sees a Container without an Application in runningApplications. I changed this code to call handleRunningAppOnNode instead of simply adding the Application, which basically makes it check that the Application still exists. I'm not exactly sure why this is happening, but from what I can tell, this issue is based on some timing of when things occur, and somehow DECOMMISSIONING makes it more likely to happen.
I've attached a 002 patch with the new changes. I ran my test over 150 times with the 002 patch and it worked every time. When I ran my test without the patch (or with the 001 patch, or with just adding the code removed by
YARN-4676), it would fail on the first run, except for one time where it failed on the second.
Junping Du, please take a look.