Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9950

TestDecommissioningStatus fails intermittently in trunk

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: test
    • Labels:
      None

      Description

      I often found that the testcase TestDecommissioningStatus failed sometimes. And I looked the test failed report, it always show these error infos:

      testDecommissionStatus(org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus)  Time elapsed: 0.462 sec  <<< FAILURE!
      java.lang.AssertionError: Unexpected num under-replicated blocks expected:<3> but was:<4>
      	at org.junit.Assert.fail(Assert.java:88)
      	at org.junit.Assert.failNotEquals(Assert.java:743)
      	at org.junit.Assert.assertEquals(Assert.java:118)
      	at org.junit.Assert.assertEquals(Assert.java:555)
      	at org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.checkDecommissionStatus(TestDecommissioningStatus.java:196)
      	at org.apache.hadoop.hdfs.server.namenode.TestDecommissioningStatus.testDecommissionStatus(TestDecommissioningStatus.java:291)
      

      And I know the reason is that the under-replicated num is not correct in method checkDecommissionStatus of TestDecommissioningStatus#testDecommissionStatus.

      In this testcase, each datanode should have 4 blocks(2 for decommission.dat, 2 for decommission.dat1)The expect num 3 on first node is because the lastBlock of uc blockCollection can not be replicated if its numlive just more than blockManager minReplication(in this case is 1). And before decommed second datanode, it has already one live replication for the uc blockCollection' lastBlock in this node.

      So in this failed case, the first node's under-replicat changes to 4 indicated that the uc blockCollection lastBlock's livenum is already 0 before the second datanode decommed. So I think there are two possibilitys will lead to it.

      • The second datanode was already decommed before node one.
      • Creating file decommission.dat1 failed that lead that the second datanode has no this block.

      And I read the code, it has checked the decommission-in-progress nodes here

      if (iteration == 0) {
              assertEquals(decommissioningNodes.size(), 1);
              DatanodeDescriptor decommNode = decommissioningNodes.get(0);
              checkDecommissionStatus(decommNode, 3, 0, 1);
              checkDFSAdminDecommissionStatus(decommissioningNodes.subList(0, 1),
                  fileSys, admin);
            }
      

      So it seems the second possibility are more likely the reason. And in addition, it hasn't did a block number check when finished the creating file. So we could do a check and retry operatons here if block number is not correct as expected.

        Attachments

        1. HDFS-9950.001.patch
          3 kB
          Yiqun Lin

          Issue Links

            Activity

              People

              • Assignee:
                linyiqun Yiqun Lin
                Reporter:
                linyiqun Yiqun Lin
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: