The problem may be solved by increase the number of retries to a sufficiently large number say (the maximum mapper slots) / 256. But the performance is not good since a client could wait up to (3 * #_of_retires) seconds.
A common case we see is that a read request can be served well less than 3 sec (could be just subsecond). And it is a wait of time to wait 3 seconds and let another bunch of 256 clients read the same block. So we propose the following change in the DFSClient to introduce a random factor to the wait time. So instead of being a fixed value 3000 as the wait time, it becomes the following formular:
waitTime = 3000 * failures + 3000 * (failures + 1) * rand(0, 1);
where failures is the number of failures (starting from 0), and rand(0, 1) returns a random double from 0.0 to 1.0.
The rationale behind this formula is as follows:
1) At the first time getting a BlockMissingException, the client waits a random time from 0-3 seconds and retry. If the block read can be served very quickly, the client get get it faster than always waiting for 3 sec. Also by distributing all clients evenly in the 3 sec window, more clients will be served for this round of retry.
2) If the client still get the same exception and retry at the second time, it may be because the read is too slow or the number of requests are too large and the client is not lucky to ensure a spot in the last retry. To solve the first problem the second retry will wait 3 seconds before retry to ensure all clients in the first retry has already at least started (and hopefully some of them have already finished). To solve the second problem, we will increase the waiting window to 6 seconds and make sure less conflicts are there for the 3rd retry.
3) Similarly at the 3rd retry, we will wait for 6 second to clean up the waiting window from the 2nd retry and make the waiting window to 9 seconds.
Any comments on the design and proposal for unit test?