Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.0.0-alpha, 1.2.1
-
None
-
None
Description
Consider the following contrived example:
// Step 1: Create file with replication factor = 2 Path path = ...; short replication = 2; OutputStream os = fs.create(path, ..., replication, ...); // Step 2: Write to file os.write(...); // Step 3: Reduce replication factor to 1 fs.setReplication(path, 1); // Wait for namenode to prune excess replicates // Step 4: Read from file InputStream is = fs.open(path); is.read(...);
During the read in Step 4, the DFSInputStream client receives "stale" block locations from the NameNode. Specifically, it receives block locations that the NameNode has already pruned/invalidated (and the DataNodes have already deleted).
The net effect of this is unnecessary churn in the DFSClient (timeouts, retries, extra RPCs, etc.). In particular:
WARN hdfs.DFSClient - Failed to connect to datanode-1 for block, add to deadNodes and continue.
The blacklisting of DataNodes that are, in fact, functioning properly can lead to inefficient locality of reads. Since the blacklist is cumulative across all blocks in the file, this can have noticeable impact for large files.
A pathological case can occur when all block locations are in the blacklist. In this case, the DFSInputStream will sleep and refetch locations from the NameNode, causing unnecessary RPCs and a client-side sleep:
INFO hdfs.DFSClient - Could not obtain blk_1073741826_1002 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...
This pathological case can occur in the following example (for a read of file foo):
- DFSInputStream attempts to read block 1 of foo.
- Gets locations: ( dn1(stale), dn2 )
- Attempts read from dn1. Fails. Adds dn1 to blacklist.
- DFSInputStream attempts to read block 2 of foo.
- Gets locations: ( dn1, dn2(stale) )
- Attempts read from dn2 (dn1 already blacklisted). Fails. Adds dn2 to blacklist.
- All locations for block 2 are now in blacklist.
- Clears blacklists
- Sleeps up to 3 seconds
- Refetches locations from the NameNode
A solution would be to change the NameNode to not return stale block locations to clients for replicas that it knows it has asked DataNodes to invalidate.
A quick look at the BlockManager.chooseExcessReplicates() code path seems to indicate that the NameNode does not actually remove the pruned replica from the BlocksMap until the subsequent blockReport is received. This can leave a substantial window where the NameNode can return stale replica locations to clients.
If the NameNode were to proactively update the BlocksMap upon excess replica pruning, this situation could be avoided. If the DataNode did not actually invalidate the replica as asked, the NameNode would simply re-add the replica to the BlocksMap upon next blockReport and go through the pruning exercise again.