Can you describe this better?
If we see this in layers, we've got three layers:
Here, the layer3 knows/guess that the layer1 is dead, while the layer in the middle does not know it. That's not a perfect example of encapsulation . HBase is saying to hdfs 'you know, I want some blocks, but may be this datanode is not good, I'm not sure, but don't use it please'. Kind of strange (but useful short term).
Today, when there is a global issue, HBase starts its recovery while hdfs is still ignoring the issue. It leads to a nightmare of socket exception all over the place, as HBase is directed to dead nodes again and again. HDFS should know before HBase what's going on. So if HBase it set with a timeout of 30s, HDFS should have 20s or something like this.
Whether ZooKeeper or Datanode heartbeat to Namenode, at a high level mechanisms are similar.
Fully agreed. Just that if the issues comes from ZK or ZK links, HBase and HDFS they would have a similar view of the situation (may be a wrong view but the same view). On the other hand, there are possible improvements, not available in ZK, but hopefully available a day, when there will be more code to share (I'm thinking about ZOOKEEPER-702). Also, still long term, ZK creates one tcp connection per process monitored. If multiple hadoop processes share the same tech, it will make sense to have a shared component on each computer to lower the number of connections. I'm not aware on anything on this subject in ZK, so that's science fiction today. I've got other stuff like this in mind, but you got the idea .
So, I fully agree with your main point, today the real issue is the right timeout.
The problem is one of choosing right timeout. Currently this is configurable in HDFS and 10 minutes is chosen as the timeout. I suggest runningt some experiments with setting this to a more aggressive value. I agree that this is a very conservative time. But false positives here could result in replication storm.
Agreed, even we've the current setting, people had issues in the past. 10 minutes seems to be a reasonable-real world-validated timeout for re-replicating. I don't think it's a good idea to make lower. However, I think it would be good to have a middle state between fully available and definitively dead: the non responding nodes could be removed from the target list for new blocks and de-prioritize for reads.