Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10733

NameNode terminated after full GC thinking QJM is unresponsive.

    Details

    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      NameNode went into full GC while in AsyncLoggerSet.waitForWriteQuorum(). After completing GC it checks if the timeout for quorum is reached. If the GC was long enough the timeout can expire, and QuorumCall.waitFor() will throw TimeoutExcpetion. Finally FSEditLog.logSync() catches the exception and terminates NameNode.

      1. HDFS-10733.002.patch
        4 kB
        Vinitha Reddy Gankidi
      2. HDFS-10733.001.patch
        4 kB
        Vinitha Reddy Gankidi

        Issue Links

          Activity

          Hide
          kihwal Kihwal Lee added a comment -

          Can we something similar to HDFS-9107?

          Show
          kihwal Kihwal Lee added a comment - Can we something similar to HDFS-9107 ?
          Hide
          redvine Vinitha Reddy Gankidi added a comment -

          Kihwal Lee Thanks for the great suggestion.

          I have attached a patch that increases the endtime/timeout if there is a long pause due to a Full GC in NN. The unit test included asserts that a timeout exception is thrown instead of increasing the timeout as in the case of a Full GC if there indeed aren't any responses from the journal nodes. Please take a look.

          Show
          redvine Vinitha Reddy Gankidi added a comment - Kihwal Lee Thanks for the great suggestion. I have attached a patch that increases the endtime/timeout if there is a long pause due to a Full GC in NN. The unit test included asserts that a timeout exception is thrown instead of increasing the timeout as in the case of a Full GC if there indeed aren't any responses from the journal nodes. Please take a look.
          Hide
          shv Konstantin Shvachko added a comment -

          This looks reasonable to me. Similar to HeartbeatManager approach.
          One nit that assertEquals() should print a meaningful message rather than just asserting. I see the other usages do not have the message, but let's at least not multiply the wrong pattern with new test cases.

          Show
          shv Konstantin Shvachko added a comment - This looks reasonable to me. Similar to HeartbeatManager approach. One nit that assertEquals() should print a meaningful message rather than just asserting. I see the other usages do not have the message, but let's at least not multiply the wrong pattern with new test cases.
          Hide
          redvine Vinitha Reddy Gankidi added a comment -

          Konstantin Shvachko I agree. Attached a new patch with this change.

          Show
          redvine Vinitha Reddy Gankidi added a comment - Konstantin Shvachko I agree. Attached a new patch with this change.
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 19s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 13m 50s trunk passed
          +1 compile 0m 51s trunk passed
          +1 checkstyle 0m 28s trunk passed
          +1 mvnsite 0m 57s trunk passed
          +1 mvneclipse 0m 14s trunk passed
          +1 findbugs 1m 54s trunk passed
          +1 javadoc 0m 43s trunk passed
          +1 mvninstall 0m 52s the patch passed
          +1 compile 0m 48s the patch passed
          +1 javac 0m 48s the patch passed
          +1 checkstyle 0m 26s the patch passed
          +1 mvnsite 0m 53s the patch passed
          +1 mvneclipse 0m 11s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 2m 1s the patch passed
          +1 javadoc 0m 40s the patch passed
          -1 unit 83m 9s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 34s The patch does not generate ASF License warnings.
          110m 9s



          Reason Tests
          Timed out junit tests org.apache.hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean
            org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:a9ad5d6
          JIRA Issue HDFS-10733
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846719/HDFS-10733.002.patch
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux 9e39f19c6ee1 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 4db119b
          Default Java 1.8.0_111
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/18142/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18142/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18142/console
          Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 13m 50s trunk passed +1 compile 0m 51s trunk passed +1 checkstyle 0m 28s trunk passed +1 mvnsite 0m 57s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 1m 54s trunk passed +1 javadoc 0m 43s trunk passed +1 mvninstall 0m 52s the patch passed +1 compile 0m 48s the patch passed +1 javac 0m 48s the patch passed +1 checkstyle 0m 26s the patch passed +1 mvnsite 0m 53s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 2m 1s the patch passed +1 javadoc 0m 40s the patch passed -1 unit 83m 9s hadoop-hdfs in the patch failed. +1 asflicense 0m 34s The patch does not generate ASF License warnings. 110m 9s Reason Tests Timed out junit tests org.apache.hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean   org.apache.hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue HDFS-10733 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846719/HDFS-10733.002.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 9e39f19c6ee1 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4db119b Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/18142/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/18142/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/18142/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          shv Konstantin Shvachko added a comment -

          +1 on the latest patch.
          Failed tests passed locally for me. Will commit in a bit.

          Show
          shv Konstantin Shvachko added a comment - +1 on the latest patch. Failed tests passed locally for me. Will commit in a bit.
          Hide
          shv Konstantin Shvachko added a comment -

          I just committed this. Thank you, Vinitha.

          Show
          shv Konstantin Shvachko added a comment - I just committed this. Thank you, Vinitha.
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11135 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11135/)
          HDFS-10733. NameNode terminated after full GC thinking QJM is (shv: rev 8a0fa0f7e88c45a98c6f266d6349cb426dd06495)

          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java
          • (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #11135 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11135/ ) HDFS-10733 . NameNode terminated after full GC thinking QJM is (shv: rev 8a0fa0f7e88c45a98c6f266d6349cb426dd06495) (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/client/QuorumCall.java (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/qjournal/client/TestQuorumCall.java
          Hide
          yangjiandan Jiandan Yang added a comment -

          Vinitha Reddy Gankidi I would like to ask you a question for the following code. When the fullgc time has far exceeded the value of 'millis', (et + millis) will be negative and then a TimeoutException will be thrown. How about et=now+millis?

          if (shouldIncreaseQuorumTimeout(-rem, millis)) {
            et = et + millis;
          }
          
          Show
          yangjiandan Jiandan Yang added a comment - Vinitha Reddy Gankidi I would like to ask you a question for the following code. When the fullgc time has far exceeded the value of 'millis', (et + millis) will be negative and then a TimeoutException will be thrown. How about et=now+millis? if (shouldIncreaseQuorumTimeout(-rem, millis)) { et = et + millis; }
          Hide
          xkrogen Erik Krogen added a comment -

          Jiandan Yang , you are absolutely right. I have filed HDFS-12323 for this bug.

          Show
          xkrogen Erik Krogen added a comment - Jiandan Yang , you are absolutely right. I have filed HDFS-12323 for this bug.

            People

            • Assignee:
              redvine Vinitha Reddy Gankidi
              Reporter:
              shv Konstantin Shvachko
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development