This is essentially the same race condition as in YARN-5901, that is, resourcemanager.getServiceState() == STATE.STARTED does not guarantee resource manager is fully started.
Uploading the same fix as for YARN-5901. If this unreliable check is used often in the code base, we could extract it as a util method.
This message was automatically generated.
Do we get any test failure without this change ? IIUC client will retry if the service address is unreachable.
You however are right about service state being set before RM actually starting.
Yes, we have seen consistent failures on some of our machines. My guess is the thread that starts the resource manager is always delayed on that platform. It is delayed so much so that client cannot reach the sever even with 10 retries.
Okay...The fix makes sense. Let me have a closer look.
Haibo Chen, previously we were throwing IOException if RM did not change to STARTED state.
So shouldn't we check result of CountDownLatch#await and if its false, throw an IOException as that will indicate RM has not yet started.
My bad. I incorrectly interpret the CountDownLatch API. WIll update the patch incorporate your comments.
Probably we can make changes consistent with changes in YARN-5901.
Upload another patch to make it consistent with YARN-5901 as Varun Saxena suggested.
+1 pending Jenkins.
Will commit it later today unless there are further comments.
FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11030 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11030/)
YARN-5903. Fix race condition in (varunsaxena: rev 38e66d4d64f3c2e2bb43d8e5dca3866d672322b6)
Committed to trunk,branch-2.
Thanks Haibo Chen for your contribution.