Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14626

Decommission all nodes hosting last block of open file succeeds unexpectedly

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.3.0
    • None
    • None
    • None

    Description

      I have been investigating scenarios that cause decommission to hang, especially around one long standing issue. That is, an open block on the host which is being decommissioned can cause the process to never complete.

      Checking the history, there seems to have been at least one change in HDFS-5579 which greatly improved the situation, but from reading comments and support cases, there still seems to be some scenarios where open blocks on a DN host cause the decommission to get stuck.

      No matter what I try, I have not been able to reproduce this, but I think I have uncovered another issue that may partly explain why.

      If I do the following, the nodes will decommission without any issues:

      1. Create a file and write to it so it crosses a block boundary. Then there is one complete block and one under construction block. Keep the file open, and write a few bytes periodically.

      2. Now note the nodes which the UC block is currently being written on, and decommission them all.

      3. The decommission should succeed.

      4. Now attempt to close the open file, and it will fail to close with an error like below, probably as decommissioned nodes are not allowed to send IBRs:

      java.io.IOException: Unable to close file because the last block BP-646926902-192.168.0.20-1562099323291:blk_1073741827_1003 does not have enough number of replicas.
          at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:968)
          at org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:911)
          at org.apache.hadoop.hdfs.DFSOutputStream.closeImpl(DFSOutputStream.java:894)
          at org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:849)
          at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:72)
          at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:101)

      Interestingly, if you recommission the nodes without restarting them before closing the file, it will close OK, and writes to it can continue even once decommission has completed.

      I don't think this is expected - ie decommission should not complete on all nodes hosting the last UC block of a file?

      From what I have figured out, I don't think UC blocks are considered in the DatanodeAdminManager at all. This is because the original list of blocks it cares about, are taken from the Datanode block Iterator, which takes them from the DatanodeStorageInfo objects attached to the datanode instance. I believe UC blocks don't make it into the DatanodeStoreageInfo until after they have been completed and an IBR sent, so the decommission logic never considers them.

      What troubles me about this explanation, is how did open files previously cause decommission to get stuck if it never checks for them, so I suspect I am missing something.

      I will attach a patch with a test case that demonstrates this issue. This reproduces on trunk and I also tested on CDH 5.8.1, which is based on the 2.6 branch, but with a lot of backports.

      Attachments

        1. test-to-reproduce.patch
          3 kB
          Stephen O'Donnell

        Issue Links

          Activity

            People

              Unassigned Unassigned
              sodonnell Stephen O'Donnell
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated: