[HDFS-11353] Improve the unit tests relevant to DataNode volume failure testing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0-alpha2
Fix Version/s: 3.0.0-alpha4
Component/s: None
Labels:
None

Target Version/s:

3.0.0-alpha4
Hadoop Flags:

Reviewed

Description

Currently there are many tests which start with TestDataNodeVolumeFailure* frequently run timedout or failed. I found one failure test in recent Jenkins building. The stack info:

org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures
java.util.concurrent.TimeoutException: Timed out waiting for DN to die
	at org.apache.hadoop.hdfs.DFSTestUtil.waitForDatanodeDeath(DFSTestUtil.java:702)
	at org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureReporting.testSuccessiveVolumeFailures(TestDataNodeVolumeFailureReporting.java:208)

The related codes:

    /*
     * Now fail the 2nd volume on the 3rd datanode. All its volumes
     * are now failed and so it should report two volume failures
     * and that it's no longer up. Only wait for two replicas since
     * we'll never get a third.
     */
    DataNodeTestUtils.injectDataDirFailure(dn3Vol2);
    Path file3 = new Path("/test3");
    DFSTestUtil.createFile(fs, file3, 1024, (short)3, 1L);
    DFSTestUtil.waitReplication(fs, file3, (short)2);

    // The DN should consider itself dead
    DFSTestUtil.waitForDatanodeDeath(dns.get(2));

Here the code waits for the datanode failed all the volume and then become dead. But it timed out. We would be better to compare that if all the volumes are failed then wair for the datanode dead.

In addition, we can use the method checkDiskErrorSync to do the disk error check instead of creaing files. In this JIRA, I would like to extract this logic and defined that in DataNodeTestUtils. And then we can reuse this method for datanode volme failure testing in the future.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-11353.006.patch
02/Feb/17 01:37
14 kB
Yiqun Lin
HDFS-11353.005.patch
28/Jan/17 03:12
13 kB
Yiqun Lin
HDFS-11353.004.patch
26/Jan/17 03:18
11 kB
Yiqun Lin
HDFS-11353.003.patch
24/Jan/17 02:20
10 kB
Yiqun Lin
HDFS-11353.002.patch
23/Jan/17 13:53
10 kB
Yiqun Lin
HDFS-11353.001.patch
20/Jan/17 12:49
13 kB
Yiqun Lin

Issue Links

is duplicated by

HDFS-11372 Increase test timeouts that are too aggressive.

Resolved

Activity

People

Assignee:: Yiqun Lin

Reporter:: Yiqun Lin

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Jan/17 12:36

Updated:: 08/Feb/17 11:03

Resolved:: 02/Feb/17 11:42