Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-5380

NameNode returns stale block locations to clients during excess replica pruning


    • Type: Bug Bug
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.0.0-alpha, 1.2.1
    • Fix Version/s: None
    • Component/s: namenode
    • Labels:


      Consider the following contrived example:

      // Step 1: Create file with replication factor = 2
      Path path = ...;
      short replication = 2;
      OutputStream os = fs.create(path, ..., replication, ...);
      // Step 2: Write to file
      // Step 3: Reduce replication factor to 1
      fs.setReplication(path, 1);
      // Wait for namenode to prune excess replicates
      // Step 4: Read from file
      InputStream is = fs.open(path);

      During the read in Step 4, the DFSInputStream client receives "stale" block locations from the NameNode. Specifically, it receives block locations that the NameNode has already pruned/invalidated (and the DataNodes have already deleted).

      The net effect of this is unnecessary churn in the DFSClient (timeouts, retries, extra RPCs, etc.). In particular:

      WARN  hdfs.DFSClient - Failed to connect to datanode-1 for block, add to deadNodes and continue.

      The blacklisting of DataNodes that are, in fact, functioning properly can lead to inefficient locality of reads. Since the blacklist is cumulative across all blocks in the file, this can have noticeable impact for large files.

      A pathological case can occur when all block locations are in the blacklist. In this case, the DFSInputStream will sleep and refetch locations from the NameNode, causing unnecessary RPCs and a client-side sleep:

      INFO  hdfs.DFSClient - Could not obtain blk_1073741826_1002 from any node: java.io.IOException: No live nodes contain current block. Will get new block locations from namenode and retry...

      This pathological case can occur in the following example (for a read of file foo):

      1. DFSInputStream attempts to read block 1 of foo.
      2. Gets locations: ( dn1(stale), dn2 )
      3. Attempts read from dn1. Fails. Adds dn1 to blacklist.
      4. DFSInputStream attempts to read block 2 of foo.
      5. Gets locations: ( dn1, dn2(stale) )
      6. Attempts read from dn2 (dn1 already blacklisted). Fails. Adds dn2 to blacklist.
      7. All locations for block 2 are now in blacklist.
      8. Clears blacklists
      9. Sleeps up to 3 seconds
      10. Refetches locations from the NameNode

      A solution would be to change the NameNode to not return stale block locations to clients for replicas that it knows it has asked DataNodes to invalidate.

      A quick look at the BlockManager.chooseExcessReplicates() code path seems to indicate that the NameNode does not actually remove the pruned replica from the BlocksMap until the subsequent blockReport is received. This can leave a substantial window where the NameNode can return stale replica locations to clients.

      If the NameNode were to proactively update the BlocksMap upon excess replica pruning, this situation could be avoided. If the DataNode did not actually invalidate the replica as asked, the NameNode would simply re-add the replica to the BlocksMap upon next blockReport and go through the pruning exercise again.


        Eric Sirianni added a comment -

        JUnit test that demonstrates this issue using MiniDFSCluster

        Eric Sirianni added a comment - JUnit test that demonstrates this issue using MiniDFSCluster


          • Assignee:
            Eric Sirianni
          • Votes:
            1 Vote for this issue
            3 Start watching this issue


            • Created: