Affects Version/s: 2.0.0-alpha, 1.2.1
Fix Version/s: None
Consider the following contrived example:
During the read in Step 4, the DFSInputStream client receives "stale" block locations from the NameNode. Specifically, it receives block locations that the NameNode has already pruned/invalidated (and the DataNodes have already deleted).
The net effect of this is unnecessary churn in the DFSClient (timeouts, retries, extra RPCs, etc.). In particular:
The blacklisting of DataNodes that are, in fact, functioning properly can lead to inefficient locality of reads. Since the blacklist is cumulative across all blocks in the file, this can have noticeable impact for large files.
A pathological case can occur when all block locations are in the blacklist. In this case, the DFSInputStream will sleep and refetch locations from the NameNode, causing unnecessary RPCs and a client-side sleep:
This pathological case can occur in the following example (for a read of file foo):
- DFSInputStream attempts to read block 1 of foo.
- Gets locations: ( dn1(stale), dn2 )
- Attempts read from dn1. Fails. Adds dn1 to blacklist.
- DFSInputStream attempts to read block 2 of foo.
- Gets locations: ( dn1, dn2(stale) )
- Attempts read from dn2 (dn1 already blacklisted). Fails. Adds dn2 to blacklist.
- All locations for block 2 are now in blacklist.
- Clears blacklists
- Sleeps up to 3 seconds
- Refetches locations from the NameNode
A solution would be to change the NameNode to not return stale block locations to clients for replicas that it knows it has asked DataNodes to invalidate.
A quick look at the BlockManager.chooseExcessReplicates() code path seems to indicate that the NameNode does not actually remove the pruned replica from the BlocksMap until the subsequent blockReport is received. This can leave a substantial window where the NameNode can return stale replica locations to clients.
If the NameNode were to proactively update the BlocksMap upon excess replica pruning, this situation could be avoided. If the DataNode did not actually invalidate the replica as asked, the NameNode would simply re-add the replica to the BlocksMap upon next blockReport and go through the pruning exercise again.