Rohith Sharma K S Thanks for reviewing! You are right in the sense this patch is mostly letting DrainDispatcher not reuse AsyncDispatcher's drained field, but the fix for
YARN-2991 is still there.
does small tiny race is causing TEZ test failures?
Yes. In Tez UT tests, invocation of dispatcher.await() finished without handling all events and assertion after dispatcher.await() failed. This race condition only happens when queue is almost empty, which is exactly the case in Tez UT tests.
If so would it be good to fix in AsyncDispatcher rather adding full duplicate code.
The root cause of race is we cannot guarantee we enqueue event and update drained atomically. I didn't find a way to fix this without adding more synchronization which is a very expensive fix for a minimum benefit.
YARN-3878 discussed about this race and decided to ignore it for the same reason.
How about adding additional check before adding into event queue to avoid a race?
While this may avoid enqueuing last event, race can still happen without invoking dispatcher.serviceStop(). Actually in Tez UT test, we never invoke dispatcher.serviceStop().