My thinking is that it is bad if the client gives up too early and does not retry. The application will encounter an IO error if the client gives up prematurely.
>an ongoing pipeline recovery on the same block.
It is possible that the first attempt from the client encountered a ongoing-pipeline-recovery on the primary datanode. But that does not mean that if the client retries the recoverBlock on the newly selected primary (originally, the second datanode in the pipeline) that it too will encounter an ongoing-pipeline recovery! It is possible that the original primary is network partitioned from the remaining datanodes in the pipiline and the original-pipeline recovery never succeeded. Isn's this situation possible?
I am wondering why the need to not retry? Not retryign means that the client IO will fail. This is very bad, isn't it? I am assuming that ss long as there is some possibility of recovery, the system should try all those opportunities to not make the client IO fail. Especially when the tradeoff is negligible extra RPC overhead and that too only in error cases.
However, I like the idea of the client seeing if it is AlreadyCommitted execption and not retrying in that case.