Been thinking about this a bit tonight. It seems to me we have the following classes of errors to deal with:
- A DN has died but the NN does not yet know about it. Thus, the client fails entirely to connect to the DN. Ideally, the client shouldn't reconnect to this for quite some time.
- A DN is heavily loaded (above its max.xcievers value) and thus the client is rejected. But, we'd like to retry it reasonably often, and ideally don't want to fail a read completely, even if all replicas are in this state for a short period of time.
- A particular replica is corrupt or missing on a DN. Here, we just want to avoid reading this particular block from this DN until the block has been rereplicated from a healthy copy.
Case #3 above is actually handled implicitly with no long-term/inter-operation tracking on the client, since the client will report the bad block to the NN immediately upon discovering it. Then, on the next getBlockLocations call for the same block, it will automatically be filtered out of the LocatedBlocks result by the NN. When it's been fixed up, the new valid location will end up in the LocatedBlocks result (whether on the same DN or a different one)
Given this, I disagree with the Description of this issue - I don't think the client needs to track failures by block, just by datanode, as long as checksum failures are handled differently than connection or timeout failures.
The remaining question is how to handle both case 1 and case 2 above in a convenient manner. Here's one idea:
Whenever the client fails to read from a datanode, the timestamp of the failure is recorded in a map keyed by node. When a block is to be read, the list of locations is sorted based on ascending timestamp of last faillure - thus the nodes that have had problems least recently are retried first. Any node with last failure past some threshold in the past (eg 5 minutes) is considered to have never failed and is removed from the map. Any node that has no recorded failure info should be prioritized above any node that does have failure info.
This should be fairly simple to implement without any protocol changes, and also easy to reason about. The map would ideally be DFSClient-wide so applications that use a lot of separate InputStreams won't use a lot of extra memory, and can share their view of the DN health.
One possible improvement on the above is to use datanode heartbeat times to distinguish between case 1 and case 2. Specifically, a "relativeLastHeartbeat" field could be added to LocatedBlocks for each datanode. The DN can then use this information to remove failure info for any DN whose failures were recorded before the last heartbeat. Thus, it will retry heavily loaded nodes once per heartbeat interval, but won't retry nodes that have actually failed. The downside is that this would require a protocol change, and be harder to reason about for cases like network partitions where a DN is heartbeating fine but some set of clients can't connect to it.
Looking forward to hearing people's thoughts.