Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.23.0
    • Fix Version/s: 0.23.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      In server.namenode.TestBlocksWithNotEnoughRacks.testSufficientlyReplicatedBlocksWithNotEnoughRacks
      assert fails at curReplicas == REPLICATION_FACTOR, but it seems that it should go higher initially, and if the test doesn't wait for it to go back down, it will fail false positive.

        Issue Links

          Activity

          Hide
          Tsz Wo Nicholas Sze added a comment -

          Since we are not reverting the patch, re-close this. If the test is still failing, please create a new issue.

          Show
          Tsz Wo Nicholas Sze added a comment - Since we are not reverting the patch, re-close this. If the test is still failing, please create a new issue.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Todd, I think we should create a new bug instead of reopening this unless we are going to revert the committed patch. What do you think?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Todd, I think we should create a new bug instead of reopening this unless we are going to revert the committed patch. What do you think?
          Show
          Todd Lipcon added a comment - Seems this is still failing intermittently: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/537//testReport/org.apache.hadoop.hdfs.server.namenode/TestBlocksWithNotEnoughRacks/testNodeDecomissionWithOverreplicationRespectsRackPolicy/
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #643 (See https://builds.apache.org/hudson/job/Hadoop-Hdfs-trunk/643/ )
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #591 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/591/)
          HDFS-1828. TestBlocksWithNotEnoughRacks intermittently fails assert. Contributed by Matt Foley

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #591 (See https://hudson.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/591/ ) HDFS-1828 . TestBlocksWithNotEnoughRacks intermittently fails assert. Contributed by Matt Foley
          Hide
          Eli Collins added a comment -

          I've committed this. Thanks Matt!

          Show
          Eli Collins added a comment - I've committed this. Thanks Matt!
          Hide
          Eli Collins added a comment -

          Test failures are unrelated.

          Show
          Eli Collins added a comment - Test failures are unrelated.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12476176/TestBlocksWithNotEnoughRacks_v2.patch
          against trunk revision 1091874.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these core unit tests:
          org.apache.hadoop.hdfs.server.namenode.TestNodeCount
          org.apache.hadoop.hdfs.TestFileConcurrentReader

          -1 contrib tests. The patch failed contrib unit tests.

          +1 system test framework. The patch passed system test framework compile.

          Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//testReport/
          Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12476176/TestBlocksWithNotEnoughRacks_v2.patch against trunk revision 1091874. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed these core unit tests: org.apache.hadoop.hdfs.server.namenode.TestNodeCount org.apache.hadoop.hdfs.TestFileConcurrentReader -1 contrib tests. The patch failed contrib unit tests. +1 system test framework. The patch passed system test framework compile. Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//testReport/ Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/366//console This message is automatically generated.
          Hide
          Eli Collins added a comment -

          +1

          I'll update the patch on HDFS-1562 to cover this case as well.

          I believe that under the circumstances of the test, curReplicas will in fact be REPLICATION_FACTOR + 1, transiently.

          I suspect this is because, post HDFS-15, the block remains in pending replications even though there are sufficient total # replicas, so as soon as the new datanodes come up a new replica is scheduled and an existing one is considered excess and is scheduled for deletion (it's considered excess because the replication factor has not yet been increased). Then the replication factor is increased causing 2 new replicas to be scheduled. If these new replicas complete before the excess replica is deleted then we've got REPLICATION_FACTOR + 1.

          Show
          Eli Collins added a comment - +1 I'll update the patch on HDFS-1562 to cover this case as well. I believe that under the circumstances of the test, curReplicas will in fact be REPLICATION_FACTOR + 1, transiently. I suspect this is because, post HDFS-15 , the block remains in pending replications even though there are sufficient total # replicas, so as soon as the new datanodes come up a new replica is scheduled and an existing one is considered excess and is scheduled for deletion (it's considered excess because the replication factor has not yet been increased). Then the replication factor is increased causing 2 new replicas to be scheduled. If these new replicas complete before the excess replica is deleted then we've got REPLICATION_FACTOR + 1.
          Hide
          Matt Foley added a comment -

          Replacing this with a very minimalist 1-line patch, to avoid interfering with the upcoming submission of HDFS-1562.

          Show
          Matt Foley added a comment - Replacing this with a very minimalist 1-line patch, to avoid interfering with the upcoming submission of HDFS-1562 .
          Hide
          Eli Collins added a comment -

          Hey Matt,

          I think the latest patch on HDFS-1562 addresses this issue, mind taking a look and confirming?

          Thanks,
          Eli

          Show
          Eli Collins added a comment - Hey Matt, I think the latest patch on HDFS-1562 addresses this issue, mind taking a look and confirming? Thanks, Eli
          Hide
          Matt Foley added a comment -

          As part of HDFS-1295, this patch caused TestBlocksWithNotEnoughRacks to pass in build PreCommit-HDFS-Build/346, whereas it had failed before.

          Posted here for information, but I'm going to subordinate this bug to HDFS-1295 (which it was blocking) and submit a single patch to that Jira.

          Show
          Matt Foley added a comment - As part of HDFS-1295 , this patch caused TestBlocksWithNotEnoughRacks to pass in build PreCommit-HDFS-Build/346, whereas it had failed before. Posted here for information, but I'm going to subordinate this bug to HDFS-1295 (which it was blocking) and submit a single patch to that Jira.
          Hide
          Matt Foley added a comment -

          At line 79, it waits
          while ( (numRacks < 2) || (curReplicas < REPLICATION_FACTOR) || (neededReplicationSize > 0))

          But at line 95 it asserts
          assertTrue(curReplicas == REPLICATION_FACTOR)

          I believe that under the circumstances of the test, curReplicas will in fact be REPLICATION_FACTOR + 1, transiently. Changed the "while" to wait for the desired equality, i.e.,
          while (curReplicas != REPLICATION_FACTOR).

          Also changed the wait from infinite to a 20sec timeout with useful status output on failure.

          Similar code in the other test case under TestBlocksWithNotEnoughRacks does not have the same problem, but still changed the wait from infinite to bounded.

          Finally, added additional log messages useful for debugging if future problems.

          Show
          Matt Foley added a comment - At line 79, it waits while ( (numRacks < 2) || (curReplicas < REPLICATION_FACTOR) || (neededReplicationSize > 0)) But at line 95 it asserts assertTrue(curReplicas == REPLICATION_FACTOR) I believe that under the circumstances of the test, curReplicas will in fact be REPLICATION_FACTOR + 1, transiently. Changed the "while" to wait for the desired equality, i.e., while (curReplicas != REPLICATION_FACTOR). Also changed the wait from infinite to a 20sec timeout with useful status output on failure. Similar code in the other test case under TestBlocksWithNotEnoughRacks does not have the same problem, but still changed the wait from infinite to bounded. Finally, added additional log messages useful for debugging if future problems.

            People

            • Assignee:
              Matt Foley
              Reporter:
              Matt Foley
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development