Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-1855

TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different ways

    Details

    • Type: Test Test
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.22.0
    • Fix Version/s: 0.22.0, 0.23.0
    • Component/s: test
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The second part of test case TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy(), "corrupt replica recovery for two corrupt replicas", always fails, half the time with a checksum error upon block replication, and half the time by timing out upon failure to delete the second corrupt replica.

        Activity

        Hide
        Matt Foley added a comment -

        In method blockCorruptionRecoveryPolicy(), 5 nodes are created, 3 with replicas of a certain block. Two of those replicas, in the nodes at index [0] and [1], are deliberately corrupted. Then it attempts to restart those two nodes so the corruption will be detected.

        The loop that is intended to restart both datanodes starts with [0]. But when it restarts [0], it is removed from the MiniCluster's arraylist and re-added to the end. As a result, [1] moves to [0]. But the loop then restarts the new [1], which was the former [2], which doesn't contain a corrupt replica. As a result, the corrupt replica in the former [1] never gets detected.

        In resolving the corruption, one of two errors can happen, with probability 50%: Since the namenode thinks it still has two good replicas, it may pick the corrupt replica as the source for re-replication. That will cause a checksum error at the receiving node.

        Alternatively, it may pick the one valid replica as the source, and replicate it, and delete the bad replica from the original [0]. However, since it doesn't know that the replica on the former [1] is corrupt, it never issues the delete request. This causes the test case to time out on the wait for corrupt replica deletion.

        This problem is resolved by looping from high [1] to low [0], as is done in certain MiniDFSCluster methods.

        Show
        Matt Foley added a comment - In method blockCorruptionRecoveryPolicy(), 5 nodes are created, 3 with replicas of a certain block. Two of those replicas, in the nodes at index [0] and [1] , are deliberately corrupted. Then it attempts to restart those two nodes so the corruption will be detected. The loop that is intended to restart both datanodes starts with [0] . But when it restarts [0] , it is removed from the MiniCluster's arraylist and re-added to the end. As a result, [1] moves to [0] . But the loop then restarts the new [1] , which was the former [2] , which doesn't contain a corrupt replica. As a result, the corrupt replica in the former [1] never gets detected. In resolving the corruption, one of two errors can happen, with probability 50%: Since the namenode thinks it still has two good replicas, it may pick the corrupt replica as the source for re-replication. That will cause a checksum error at the receiving node. Alternatively, it may pick the one valid replica as the source, and replicate it, and delete the bad replica from the original [0] . However, since it doesn't know that the replica on the former [1] is corrupt, it never issues the delete request. This causes the test case to time out on the wait for corrupt replica deletion. This problem is resolved by looping from high [1] to low [0] , as is done in certain MiniDFSCluster methods.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12476970/TestDatanodeBlockScanner_bug_v1.patch
        against trunk revision 1095789.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed these core unit tests:
        org.apache.hadoop.hdfs.TestFileConcurrentReader

        +1 contrib tests. The patch passed contrib unit tests.

        +1 system test framework. The patch passed system test framework compile.

        Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/396//testReport/
        Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/396//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/396//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476970/TestDatanodeBlockScanner_bug_v1.patch against trunk revision 1095789. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.TestFileConcurrentReader +1 contrib tests. The patch passed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/396//testReport/ Findbugs warnings: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/396//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/396//console This message is automatically generated.
        Hide
        Eli Collins added a comment -

        +1

        The test failure is unrelated (HDFS-1401).

        Show
        Eli Collins added a comment - +1 The test failure is unrelated ( HDFS-1401 ).
        Hide
        Eli Collins added a comment -

        I've committed this to trunk and branch 22. Thanks Matt!

        Show
        Eli Collins added a comment - I've committed this to trunk and branch 22. Thanks Matt!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk-Commit #600 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/600/)
        HDFS-1855. TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different ways. Contributed by Matt Foley

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #600 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/600/ ) HDFS-1855 . TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different ways. Contributed by Matt Foley
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-22-branch #35 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/35/)
        HDFS-1855. svn merge -c 1095830 from trunk

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-22-branch #35 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-22-branch/35/ ) HDFS-1855 . svn merge -c 1095830 from trunk
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Hdfs-trunk #644 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/644/)
        HDFS-1855. TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different ways. Contributed by Matt Foley

        Show
        Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #644 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/644/ ) HDFS-1855 . TestDatanodeBlockScanner.testBlockCorruptionRecoveryPolicy() part 2 fails in two different ways. Contributed by Matt Foley

          People

          • Assignee:
            Matt Foley
            Reporter:
            Matt Foley
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development