Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-11353

Improve the unit tests relevant to DataNode volume failure testing

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0-alpha2
    • 3.0.0-alpha4
    • None
    • None

    Description

      Currently there are many tests which start with TestDataNodeVolumeFailure* frequently run timedout or failed. I found one failure test in recent Jenkins building. The stack info:

      org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures
      java.util.concurrent.TimeoutException: Timed out waiting for DN to die
      	at org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
      	at org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:208)
      

      The related codes:

          /*
           * Now fail the 2nd volume on the 3rd datanode. All its volumes
           * are now failed and so it should report two volume failures
           * and that it's no longer up. Only wait for two replicas since
           * we'll never get a third.
           */
          DataNodeTestUtils.injectDataDirFailure(dn3Vol2);
          Path file3 = new Path("/test3");
          DFSTestUtil.createFile(fs, file3, 1024, (short)3, 1L);
          DFSTestUtil.waitReplication(fs, file3, (short)2);
      
          // The DN should consider itself dead
          DFSTestUtil.waitForDatanodeDeath(dns.get(2));
      

      Here the code waits for the datanode failed all the volume and then become dead. But it timed out. We would be better to compare that if all the volumes are failed then wair for the datanode dead.

      In addition, we can use the method checkDiskErrorSync to do the disk error check instead of creaing files. In this JIRA, I would like to extract this logic and defined that in DataNodeTestUtils. And then we can reuse this method for datanode volme failure testing in the future.

      Attachments

        1. HDFS-11353.001.patch
          13 kB
          Yiqun Lin
        2. HDFS-11353.002.patch
          10 kB
          Yiqun Lin
        3. HDFS-11353.003.patch
          10 kB
          Yiqun Lin
        4. HDFS-11353.004.patch
          11 kB
          Yiqun Lin
        5. HDFS-11353.005.patch
          13 kB
          Yiqun Lin
        6. HDFS-11353.006.patch
          14 kB
          Yiqun Lin

        Issue Links

          Activity

            People

              linyiqun Yiqun Lin
              linyiqun Yiqun Lin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: