Affects Version/s: 2.0.0
Fix Version/s: None
In HDFS branch-2, this bug is fixed; but we have two other issues.
1) For simple cases as a single node died, we don't have the effect of
HDFS-3703, and the default location order leads us to try to connect to a dead datanode while we should not. This is not analysed yet. A specific JIRA will be created later.
2) If we are redirected to a wrong node, we experience a huge delay:
The pseudo code in DFSInputStream#readBlockLength is:
However, with this code, the connection is created with a null RetryPolicy. It's then defaulted to 10 retries, with:
So if the first datanode is bad, we will try 10 times before trying the second. In the context of
HBASE-6738, the split task is cancelled before we're have opened the file to split.
By nature, it's likely to be a pure HDFS issue. But may be it can be solved in HBase with the right configuration on "ipc.client.connect.max.retries".
The ideal fix (in HDFS) would be to try the datanodes once each, and then loop 10 times.