Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.23.0
    • 0.23.0
    • namenode
    • None
    • Reviewed

    Description

      In server.namenode.TestBlocksWithNotEnoughRacks.testSufficientlyReplicatedBlocksWithNotEnoughRacks
      assert fails at curReplicas == REPLICATION_FACTOR, but it seems that it should go higher initially, and if the test doesn't wait for it to go back down, it will fail false positive.

      Attachments

        1. TestBlocksWithNotEnoughRacks_v2.patch
          0.7 kB
          Matthew Foley
        2. TestBlocksWithNotEnoughRacks.java.patch
          5 kB
          Matthew Foley

        Issue Links

          Activity

            mattf Matthew Foley added a comment -

            At line 79, it waits
            while ( (numRacks < 2) || (curReplicas < REPLICATION_FACTOR) || (neededReplicationSize > 0))

            But at line 95 it asserts
            assertTrue(curReplicas == REPLICATION_FACTOR)

            I believe that under the circumstances of the test, curReplicas will in fact be REPLICATION_FACTOR + 1, transiently. Changed the "while" to wait for the desired equality, i.e.,
            while (curReplicas != REPLICATION_FACTOR).

            Also changed the wait from infinite to a 20sec timeout with useful status output on failure.

            Similar code in the other test case under TestBlocksWithNotEnoughRacks does not have the same problem, but still changed the wait from infinite to bounded.

            Finally, added additional log messages useful for debugging if future problems.

            mattf Matthew Foley added a comment - At line 79, it waits while ( (numRacks < 2) || (curReplicas < REPLICATION_FACTOR) || (neededReplicationSize > 0)) But at line 95 it asserts assertTrue(curReplicas == REPLICATION_FACTOR) I believe that under the circumstances of the test, curReplicas will in fact be REPLICATION_FACTOR + 1, transiently. Changed the "while" to wait for the desired equality, i.e., while (curReplicas != REPLICATION_FACTOR). Also changed the wait from infinite to a 20sec timeout with useful status output on failure. Similar code in the other test case under TestBlocksWithNotEnoughRacks does not have the same problem, but still changed the wait from infinite to bounded. Finally, added additional log messages useful for debugging if future problems.
            mattf Matthew Foley added a comment -

            As part of HDFS-1295, this patch caused TestBlocksWithNotEnoughRacks to pass in build PreCommit-HDFS-Build/346, whereas it had failed before.

            Posted here for information, but I'm going to subordinate this bug to HDFS-1295 (which it was blocking) and submit a single patch to that Jira.

            mattf Matthew Foley added a comment - As part of HDFS-1295 , this patch caused TestBlocksWithNotEnoughRacks to pass in build PreCommit-HDFS-Build/346, whereas it had failed before. Posted here for information, but I'm going to subordinate this bug to HDFS-1295 (which it was blocking) and submit a single patch to that Jira.
            eli Eli Collins added a comment -

            Hey Matt,

            I think the latest patch on HDFS-1562 addresses this issue, mind taking a look and confirming?

            Thanks,
            Eli

            eli Eli Collins added a comment - Hey Matt, I think the latest patch on HDFS-1562 addresses this issue, mind taking a look and confirming? Thanks, Eli
            mattf Matthew Foley added a comment -

            Replacing this with a very minimalist 1-line patch, to avoid interfering with the upcoming submission of HDFS-1562.

            mattf Matthew Foley added a comment - Replacing this with a very minimalist 1-line patch, to avoid interfering with the upcoming submission of HDFS-1562 .
            eli Eli Collins added a comment -

            +1

            I'll update the patch on HDFS-1562 to cover this case as well.

            I believe that under the circumstances of the test, curReplicas will in fact be REPLICATION_FACTOR + 1, transiently.

            I suspect this is because, post HDFS-15, the block remains in pending replications even though there are sufficient total # replicas, so as soon as the new datanodes come up a new replica is scheduled and an existing one is considered excess and is scheduled for deletion (it's considered excess because the replication factor has not yet been increased). Then the replication factor is increased causing 2 new replicas to be scheduled. If these new replicas complete before the excess replica is deleted then we've got REPLICATION_FACTOR + 1.

            eli Eli Collins added a comment - +1 I'll update the patch on HDFS-1562 to cover this case as well. I believe that under the circumstances of the test, curReplicas will in fact be REPLICATION_FACTOR + 1, transiently. I suspect this is because, post HDFS-15 , the block remains in pending replications even though there are sufficient total # replicas, so as soon as the new datanodes come up a new replica is scheduled and an existing one is considered excess and is scheduled for deletion (it's considered excess because the replication factor has not yet been increased). Then the replication factor is increased causing 2 new replicas to be scheduled. If these new replicas complete before the excess replica is deleted then we've got REPLICATION_FACTOR + 1.
            hadoopqa Hadoop QA added a comment -

            -1 overall. Here are the results of testing the latest attachment
            http://issues.apache.org/jira/secure/attachment/12476176/TestBlocksWithNotEnoughRacks_v2.patch
            against trunk revision 1091874.

            +1 @author. The patch does not contain any @author tags.

            +1 tests included. The patch appears to include 3 new or modified tests.

            +1 javadoc. The javadoc tool did not generate any warning messages.

            +1 javac. The applied patch does not increase the total number of javac compiler warnings.

            +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

            +1 release audit. The applied patch does not increase the total number of release audit warnings.

            -1 core tests. The patch failed these core unit tests:
            org.apache.hadoop.hdfs.server.namenode.TestNodeCount
            org.apache.hadoop.hdfs.TestFileConcurrentReader

            -1 contrib tests. The patch failed contrib unit tests.

            +1 system test framework. The patch passed system test framework compile.

            Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//testReport/
            Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
            Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//console

            This message is automatically generated.

            hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476176/TestBlocksWithNotEnoughRacks_v2.patch against trunk revision 1091874. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestNodeCount org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//console This message is automatically generated.
            eli Eli Collins added a comment -

            Test failures are unrelated.

            eli Eli Collins added a comment - Test failures are unrelated.
            eli Eli Collins added a comment -

            I've committed this. Thanks Matt!

            eli Eli Collins added a comment - I've committed this. Thanks Matt!
            hudson Hudson added a comment -

            Integrated in Hadoop-Hdfs-trunk-Commit #591 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/591/)
            HDFS-1828. TestBlocksWithNotEnoughRacks intermittently fails assert. Contributed by Matt Foley

            hudson Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #591 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/591/ ) HDFS-1828 . TestBlocksWithNotEnoughRacks intermittently fails assert. Contributed by Matt Foley
            hudson Hudson added a comment -

            Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

            hudson Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
            tlipcon Todd Lipcon added a comment - Seems this is still failing intermittently: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/537//testReport/org.apache.hadoop.hdfs.server.namenode/TestBlocksWithNotEnoughRacks/testNodeDecomissionWithOverreplicationRespectsRackPolicy/
            szetszwo Tsz-wo Sze added a comment -

            Hi Todd, I think we should create a new bug instead of reopening this unless we are going to revert the committed patch. What do you think?

            szetszwo Tsz-wo Sze added a comment - Hi Todd, I think we should create a new bug instead of reopening this unless we are going to revert the committed patch. What do you think?
            szetszwo Tsz-wo Sze added a comment -

            Since we are not reverting the patch, re-close this. If the test is still failing, please create a new issue.

            szetszwo Tsz-wo Sze added a comment - Since we are not reverting the patch, re-close this. If the test is still failing, please create a new issue.

            People

              mattf Matthew Foley
              mattf Matthew Foley
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: