Yes, that should help us in almost solving this problem.
In our internal branch(based on branch-1), we were re-throwing the exception after trying for all the nodes.
To be clear, this issue is about data loss.
Yes, Stack. This I have seen in my clusters. We solved it by adding above proposed code and
( That time I concentrated to fix
HDFS-3222 only on branch-2. But I should have proposed the changes for branch-1 as well . See the effect versions marked in HDFS-3222 ). One small gap I have seen in branch-1 is, bytes acked not tracked properly compared to hadoop-2 today. so, if we read the length from some other node which is having lesser length than primary node, and primary node connect back just before starting the actual read request. That time, still this kind of problems will be there. I have seen other JIRA, that 'we have to mark that failed node into dead node list when we get the rpc errors while fetching the length' should help in solving that issue.
Have not seen so far after that fix in our internal branch.
So, I am +1 for doing that.
@Nicolas, do you have patch ready for branch-1? if no, I will generate the patch on branch-1 in some time next week.