Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-9904

testCheckpointCancellationDuringUpload occasionally fails

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.3
    • Fix Version/s: 2.8.0, 2.7.3, 3.0.0-alpha1
    • Component/s: test
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      The failure was at the end of the test case where the txid of the standby (former active) is checked. Since the checkpoint/uploading was canceled , it is not supposed to have the new checkpoint. Looking at the test log, that was still the case, but the standby then did checkpoint on its own and bumped up the txid, right before the check was performed.

      1. HDFS-9904.001.patch
        1 kB
        Yiqun Lin
      2. HDFS-9904.002.patch
        1 kB
        Yiqun Lin

        Issue Links

          Activity

          Hide
          kihwal Kihwal Lee added a comment -

          The stack trace from the test failure.

          java.lang.AssertionError: expected:<0> but was:<106>
          	at org.junit.Assert.fail(Assert.java:88)
          	at org.junit.Assert.failNotEquals(Assert.java:743)
          	at org.junit.Assert.assertEquals(Assert.java:118)
          	at org.junit.Assert.assertEquals(Assert.java:555)
          	at org.junit.Assert.assertEquals(Assert.java:542)
          	at org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints.testCheckpointCancellationDuringUpload(TestStandbyCheckpoints.java:328)
          

          We could set DFS_NAMENODE_CHECKPOINT_TXNS_KEY differently on the first NN to avoid it doing checkpoint when it becomes a standby.

          Show
          kihwal Kihwal Lee added a comment - The stack trace from the test failure. java.lang.AssertionError: expected:<0> but was:<106> at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.junit.Assert.assertEquals(Assert.java:542) at org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints.testCheckpointCancellationDuringUpload(TestStandbyCheckpoints.java:328) We could set DFS_NAMENODE_CHECKPOINT_TXNS_KEY differently on the first NN to avoid it doing checkpoint when it becomes a standby.
          Hide
          linyiqun Yiqun Lin added a comment -

          Attach a simple patch as you said, I tested the modified testcase in local and the result is good.

          -------------------------------------------------------
           T E S T S
          -------------------------------------------------------
          Running org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints
          Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 21.487 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints
          
          Results :
          
          Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
          
          [INFO] ------------------------------------------------------------------------
          [INFO] BUILD SUCCESS
          [INFO] ------------------------------------------------------------------------
          [INFO] Total time: 45.985 s
          [INFO] Finished at: 2016-03-07T10:09:19+08:00
          [INFO] Final Memory: 44M/951M
          [INFO] ------------------------------------------------------------------------
          
          Show
          linyiqun Yiqun Lin added a comment - Attach a simple patch as you said, I tested the modified testcase in local and the result is good. ------------------------------------------------------- T E S T S ------------------------------------------------------- Running org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 21.487 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints Results : Tests run: 1, Failures: 0, Errors: 0, Skipped: 0 [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 45.985 s [INFO] Finished at: 2016-03-07T10:09:19+08:00 [INFO] Final Memory: 44M/951M [INFO] ------------------------------------------------------------------------
          Hide
          kihwal Kihwal Lee added a comment -

          Thanks for working on the fix. The config is used to determine whether to create a new checkpoint. A standby will, after loading/replaying edits, check how many transactions went by since the last checkpoint. If the number is greater than the configured limit, it will do checkpoint. As you can see from the test output, there are around 106 transactions at the end. In order to prevent the standby from checkpointing, the config value should be bigger than that. E.g. 1000. Also, it should be set before the namenode is started and should be reset for other test cases.

          Show
          kihwal Kihwal Lee added a comment - Thanks for working on the fix. The config is used to determine whether to create a new checkpoint. A standby will, after loading/replaying edits, check how many transactions went by since the last checkpoint. If the number is greater than the configured limit, it will do checkpoint. As you can see from the test output, there are around 106 transactions at the end. In order to prevent the standby from checkpointing, the config value should be bigger than that. E.g. 1000. Also, it should be set before the namenode is started and should be reset for other test cases.
          Hide
          linyiqun Yiqun Lin added a comment -

          Thanks Kihwal Lee for concrete analysation. I am ignored for that.

          Also, it should be set before the namenode is started and should be reset for other test cases.

          In method testCheckpointCancellationDuringUpload, it has already restart all namenodes after. So I reset the configuration here is ok.

              // don't compress, we want a big image
              for (int i = 0; i < NUM_NNS; i++) {
                cluster.getConfiguration(i).setBoolean(
                    DFSConfigKeys.DFS_IMAGE_COMPRESS_KEY, false);
              }
          
              // Throttle SBN upload to make it hang during upload to ANN
              for (int i = 1; i < NUM_NNS; i++) {
                cluster.getConfiguration(i).setLong(
                    DFSConfigKeys.DFS_IMAGE_TRANSFER_RATE_KEY, 100);
              }
              for (int i = 0; i < NUM_NNS; i++) {
                cluster.restartNameNode(i);
              }
          

          It seems that there was a similar problem in testNonPrimarySBNUploadFSImage. If first namenode change to standby, because 10 is bigger than 5(set value), it will also do a checkpoint. And actually, the checkpoint should be uploaded by one of standby nodes.

          doEdits(0, 10);
          cluster.transitionToStandby(0);
          

          Am I think right? If so, we can slove both two in this jira. Finally update a patch for addressing your comments.

          Show
          linyiqun Yiqun Lin added a comment - Thanks Kihwal Lee for concrete analysation. I am ignored for that. Also, it should be set before the namenode is started and should be reset for other test cases. In method testCheckpointCancellationDuringUpload , it has already restart all namenodes after. So I reset the configuration here is ok. // don't compress, we want a big image for ( int i = 0; i < NUM_NNS; i++) { cluster.getConfiguration(i).setBoolean( DFSConfigKeys.DFS_IMAGE_COMPRESS_KEY, false ); } // Throttle SBN upload to make it hang during upload to ANN for ( int i = 1; i < NUM_NNS; i++) { cluster.getConfiguration(i).setLong( DFSConfigKeys.DFS_IMAGE_TRANSFER_RATE_KEY, 100); } for ( int i = 0; i < NUM_NNS; i++) { cluster.restartNameNode(i); } It seems that there was a similar problem in testNonPrimarySBNUploadFSImage . If first namenode change to standby, because 10 is bigger than 5(set value), it will also do a checkpoint. And actually, the checkpoint should be uploaded by one of standby nodes. doEdits(0, 10); cluster.transitionToStandby(0); Am I think right? If so, we can slove both two in this jira. Finally update a patch for addressing your comments.
          Hide
          linyiqun Yiqun Lin added a comment -

          Sorry for last comments. The testcase testNonPrimarySBNUploadFSImage has no problem, I ignored that the last param txid has changed. Please ignore some comments of them.

          Show
          linyiqun Yiqun Lin added a comment - Sorry for last comments. The testcase testNonPrimarySBNUploadFSImage has no problem, I ignored that the last param txid has changed. Please ignore some comments of them.
          Hide
          kihwal Kihwal Lee added a comment -

          +1 I've verified that the config is only set for the specific test case.

          Show
          kihwal Kihwal Lee added a comment - +1 I've verified that the config is only set for the specific test case.
          Hide
          kihwal Kihwal Lee added a comment -

          I've committed this to trunk through branch-2.7. Thanks for working on this Lin Yiqun.

          Show
          kihwal Kihwal Lee added a comment - I've committed this to trunk through branch-2.7. Thanks for working on this Lin Yiqun.
          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Hadoop-trunk-Commit #9464 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9464/)
          HDFS-9904. testCheckpointCancellationDuringUpload occasionally fails. (kihwal: rev d4574017845cfa7521e703f80efd404afd09b8c4)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #9464 (See https://builds.apache.org/job/Hadoop-trunk-Commit/9464/ ) HDFS-9904 . testCheckpointCancellationDuringUpload occasionally fails. (kihwal: rev d4574017845cfa7521e703f80efd404afd09b8c4) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestStandbyCheckpoints.java
          Hide
          linyiqun Yiqun Lin added a comment -

          Thanks Kihwal Lee for commit!

          Show
          linyiqun Yiqun Lin added a comment - Thanks Kihwal Lee for commit!
          Hide
          vinodkv Vinod Kumar Vavilapalli added a comment -

          Closing the JIRA as part of 2.7.3 release.

          Show
          vinodkv Vinod Kumar Vavilapalli added a comment - Closing the JIRA as part of 2.7.3 release.

            People

            • Assignee:
              linyiqun Yiqun Lin
              Reporter:
              kihwal Kihwal Lee
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development