Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-5672

TestHASafeMode#testSafeBlockTracking fails in trunk

    Details

    • Type: Test Test
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.4.0
    • Component/s: namenode
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      From build #1614:

       TestHASafeMode.testSafeBlockTracking:623->assertSafeMode:488 Bad safemode status: 'Safe mode is ON. The reported blocks 3 needs additional 7 blocks to reach the threshold 0.9990 of total blocks 10.
      Safe mode will be turned off automatically'
      

        Issue Links

          Activity

          Hide
          Ted Yu added a comment -

          The test failed again in build #1712:

          TestHASafeMode.testSafeBlockTracking:633->assertSafeMode:493 Bad safemode status: 'Safe mode is ON. The reported blocks 12 needs additional 3 blocks to reach the threshold 0.9990 of total blocks 15.
          The number of live datanodes 3 has reached the minimum number 0. In safe mode extension. Safe mode will be turned off automatically once the thresholds have been reached.'

          Show
          Ted Yu added a comment - The test failed again in build #1712: TestHASafeMode.testSafeBlockTracking:633->assertSafeMode:493 Bad safemode status: 'Safe mode is ON. The reported blocks 12 needs additional 3 blocks to reach the threshold 0.9990 of total blocks 15. The number of live datanodes 3 has reached the minimum number 0. In safe mode extension. Safe mode will be turned off automatically once the thresholds have been reached.'
          Hide
          Jing Zhao added a comment -

          Upload a patch to fix.

          We can consistently reproduce the issue with this change in TestHASafeMode#testSafeBlockTracking:

               } finally {
          +     cluster.shutdownNameNode(1);
                 for (FSDataOutputStream stm : stms) {
                   IOUtils.closeStream(stm);
                 }
              }
          

          And the fix is just one line in BlockManager#processReportedBlock:

               if (isBlockUnderConstruction(storedBlock, ucState, reportedState)) {
          -      toUC.add(new StatefulBlockInfo(
          -          (BlockInfoUnderConstruction)storedBlock, block, reportedState));
          +      toUC.add(new StatefulBlockInfo((BlockInfoUnderConstruction) storedBlock,
          +          new Block(block), reportedState));
                 return storedBlock;
               }
          

          The issue is that when BlockManager#reportDiff iteratively calls processReportedBlock to process reported blocks, the parameter block for processReportedBlock is always the same block object in BlockReportIterator. This makes the toUC list contain incorrect information. And the wrong information in the toUC list will later be recorded as ReplicaUnderConstruction in the corresponding BlockInfo object. Later, when the corresponding file gets closed, the NN will check the replicas for the block and mark these replicas as stale if it finds inconsistency in generation stamp. This will finally affect the safe block count calculation.

          In the unit test, when the standby NN restarts, if all the DNs have pending IBR for it, SBN will first process IBR before processing the first full block report. Then SBN will call processReport, instead of processFirstBlockReport, to process full block reports from all the DNs. In this way, the above bug will be hit 3 times and the safe block count cannot get increased for the corresponding blocks.

          Show
          Jing Zhao added a comment - Upload a patch to fix. We can consistently reproduce the issue with this change in TestHASafeMode#testSafeBlockTracking: } finally { + cluster.shutdownNameNode(1); for (FSDataOutputStream stm : stms) { IOUtils.closeStream(stm); } } And the fix is just one line in BlockManager#processReportedBlock: if (isBlockUnderConstruction(storedBlock, ucState, reportedState)) { - toUC.add( new StatefulBlockInfo( - (BlockInfoUnderConstruction)storedBlock, block, reportedState)); + toUC.add( new StatefulBlockInfo((BlockInfoUnderConstruction) storedBlock, + new Block(block), reportedState)); return storedBlock; } The issue is that when BlockManager#reportDiff iteratively calls processReportedBlock to process reported blocks, the parameter block for processReportedBlock is always the same block object in BlockReportIterator. This makes the toUC list contain incorrect information. And the wrong information in the toUC list will later be recorded as ReplicaUnderConstruction in the corresponding BlockInfo object. Later, when the corresponding file gets closed, the NN will check the replicas for the block and mark these replicas as stale if it finds inconsistency in generation stamp. This will finally affect the safe block count calculation. In the unit test, when the standby NN restarts, if all the DNs have pending IBR for it, SBN will first process IBR before processing the first full block report. Then SBN will call processReport, instead of processFirstBlockReport, to process full block reports from all the DNs. In this way, the above bug will be hit 3 times and the safe block count cannot get increased for the corresponding blocks.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12636868/HDFS-5672.000.patch
          against trunk revision .

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 1 new or modified test files.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 javadoc. There were no new javadoc warning messages.

          +1 eclipse:eclipse. The patch built with eclipse:eclipse.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:

          org.apache.hadoop.hdfs.TestSafeMode

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6511//testReport/
          Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6511//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12636868/HDFS-5672.000.patch against trunk revision . +1 @author . The patch does not contain any @author tags. +1 tests included . The patch appears to include 1 new or modified test files. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. -1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestSafeMode +1 contrib tests . The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/6511//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/6511//console This message is automatically generated.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 on the patch. Good catch!

          Show
          Tsz Wo Nicholas Sze added a comment - +1 on the patch. Good catch!
          Show
          Ted Yu added a comment - Was the failure, https://builds.apache.org/job/PreCommit-HDFS-Build/6511//testReport/org.apache.hadoop.hdfs/TestSafeMode/testInitializeReplQueuesEarly/ , related to the patch ?
          Hide
          Jing Zhao added a comment - - edited

          I think it should be unrelated. The test passed in my local run. The same failure also appeared in the jenkins run of HDFS-6114 and HDFS-6068. We should create a jira to fix it.

          Show
          Jing Zhao added a comment - - edited I think it should be unrelated. The test passed in my local run. The same failure also appeared in the jenkins run of HDFS-6114 and HDFS-6068 . We should create a jira to fix it.
          Hide
          Ted Yu added a comment -

          Jing, I created HDFS-6160 to track TestSafeMode failure.

          Show
          Ted Yu added a comment - Jing, I created HDFS-6160 to track TestSafeMode failure.
          Hide
          Jing Zhao added a comment -

          Thanks Ted!

          Show
          Jing Zhao added a comment - Thanks Ted!
          Hide
          Jing Zhao added a comment -

          I've committed the patch to trunk, branch-2 and branch-2.4. Thanks a lot for the report, Ted! And thanks for the review, Nicholas!

          Show
          Jing Zhao added a comment - I've committed the patch to trunk, branch-2 and branch-2.4. Thanks a lot for the report, Ted! And thanks for the review, Nicholas!
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #5409 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5409/)
          HDFS-5672. TestHASafeMode#testSafeBlockTracking fails in trunk. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1581994)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #5409 (See https://builds.apache.org/job/Hadoop-trunk-Commit/5409/ ) HDFS-5672 . TestHASafeMode#testSafeBlockTracking fails in trunk. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1581994 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Yarn-trunk #522 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/522/)
          HDFS-5672. TestHASafeMode#testSafeBlockTracking fails in trunk. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1581994)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #522 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/522/ ) HDFS-5672 . TestHASafeMode#testSafeBlockTracking fails in trunk. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1581994 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java
          Hide
          Hudson added a comment -

          FAILURE: Integrated in Hadoop-Mapreduce-trunk #1739 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1739/)
          HDFS-5672. TestHASafeMode#testSafeBlockTracking fails in trunk. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1581994)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java
          Show
          Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #1739 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1739/ ) HDFS-5672 . TestHASafeMode#testSafeBlockTracking fails in trunk. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1581994 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Hadoop-Hdfs-trunk #1714 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1714/)
          HDFS-5672. TestHASafeMode#testSafeBlockTracking fails in trunk. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1581994)

          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java
          • /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Hadoop-Hdfs-trunk #1714 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/1714/ ) HDFS-5672 . TestHASafeMode#testSafeBlockTracking fails in trunk. Contributed by Jing Zhao. (jing9: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1581994 ) /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestHASafeMode.java

            People

            • Assignee:
              Jing Zhao
              Reporter:
              Ted Yu
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development