Thanks Wei-Chiu Chuang for reviewing changes.
IIUC, the client would stuck in chooseDataNode() in such a scenario?
Yeah, reader thread goes on retry until max retries, and gets BlockMissedException. But since this is a hedged read, already read would have completed with actual host. So read will completes successfully, but call will return to user only after all retries exhausted. Non-hedge case, read would fail. It was fixed in
The method chooseDataNode should add a @Nullable to indicate a null return value is valid.
I tried to add @Nullable, but my IDE started showing some javadoc error. So added the whole javadoc mentioning about possible null return value. Hope that satisfies you.
can be simplified as chosenNode = chooseDataNode(block, ignored, false);
Thats a good catch. changed.
The timeout of 30 seconds seems a little short. On my laptop this test takes approximately 20 seconds, so on a busy host the unit test might potentially run slightly over time. Or would it be reasonable to reduce some wait time?
E.g. reduce dfs.client.retry.window.base from 3000 to 1000?
Yeah, increased the timeout to 60000 and reduced the window time to 1000 as well. Thank you for the hint.
please check updated patch