Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-10536

Standby NN can not trigger log roll after EditLogTailer thread failed 3 times in EditLogTailer.triggerActiveLogRoll method.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha1
    • Fix Version/s: 3.0.0-alpha1
    • Component/s: auto-failover
    • Labels:

      Description

      When all NameNodes become standby, EditLogTailer will retry 3 times to trigger log roll, then it will be failed and throw Exception "Cannot find any valid remote NN to service request!". After one namenode become active, standby NN still can not trigger log roll again because variable "nnLoopCount" is still 3, it can not init to 0.

      1. HDFS-10536.02.patch
        4 kB
        XingFeng Shen
      2. HDFS-10536.patch
        1 kB
        XingFeng Shen
      3. HDFS-10536-02.patch
        4 kB
        XingFeng Shen

        Issue Links

          Activity

          Hide
          xingfengshen XingFeng Shen added a comment -

          Standby NN will throw this Exception.

          2016-06-16 20:27:51,456 | INFO  | Edit log tailer | Triggering log roll on remote NameNode | EditLogTailer.java:296
          2016-06-16 20:27:51,531 | WARN  | Edit log tailer | Failed to reach remote node: RemoteNameNodeInfo [nnId=19, ipcAddress=szv1000044725/10.120.176.172:25000, httpAddress=https://szv1000044725:25003], retrying with remaining remote NNs | EditLogTailer.java:431
          2016-06-16 20:27:51,535 | WARN  | Edit log tailer | Failed to reach remote node: RemoteNameNodeInfo [nnId=19, ipcAddress=szv1000044725/10.120.176.172:25000, httpAddress=https://szv1000044725:25003], retrying with remaining remote NNs | EditLogTailer.java:431
          2016-06-16 20:27:51,538 | WARN  | Edit log tailer | Failed to reach remote node: RemoteNameNodeInfo [nnId=19, ipcAddress=szv1000044725/10.120.176.172:25000, httpAddress=https://szv1000044725:25003], retrying with remaining remote NNs | EditLogTailer.java:431
          2016-06-16 20:27:51,538 | WARN  | Edit log tailer | Unable to trigger a roll of the active NN | EditLogTailer.java:316
          java.io.IOException: Cannot find any valid remote NN to service request!
                  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:439)
                  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:298)
                  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$800(EditLogTailer.java:70)
                  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:355)
                  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:324)
                  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:341)
                  at java.security.AccessController.doPrivileged(Native Method)
                  at javax.security.auth.Subject.doAs(Subject.java:360)
                  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1691)
                  at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:443)
                  at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:337)
          

          After one namenode become active, standby NN still can not trigger log roll again because variable "nnLoopCount" is still 3, it can not init to 0.

           private NamenodeProtocol getActiveNodeProxy() throws IOException {
                if (cachedActiveProxy == null) {
                  while (true) {
                    // if we have reached the max loop count, quit by returning null
                    if ((nnLoopCount / nnCount) >= maxRetries) {
                      return null;
                    }
                   ......
                  }
                }
                assert cachedActiveProxy != null;
                return cachedActiveProxy;
              }
          
          Show
          xingfengshen XingFeng Shen added a comment - Standby NN will throw this Exception. 2016-06-16 20:27:51,456 | INFO | Edit log tailer | Triggering log roll on remote NameNode | EditLogTailer.java:296 2016-06-16 20:27:51,531 | WARN | Edit log tailer | Failed to reach remote node: RemoteNameNodeInfo [nnId=19, ipcAddress=szv1000044725/10.120.176.172:25000, httpAddress=https: //szv1000044725:25003], retrying with remaining remote NNs | EditLogTailer.java:431 2016-06-16 20:27:51,535 | WARN | Edit log tailer | Failed to reach remote node: RemoteNameNodeInfo [nnId=19, ipcAddress=szv1000044725/10.120.176.172:25000, httpAddress=https: //szv1000044725:25003], retrying with remaining remote NNs | EditLogTailer.java:431 2016-06-16 20:27:51,538 | WARN | Edit log tailer | Failed to reach remote node: RemoteNameNodeInfo [nnId=19, ipcAddress=szv1000044725/10.120.176.172:25000, httpAddress=https: //szv1000044725:25003], retrying with remaining remote NNs | EditLogTailer.java:431 2016-06-16 20:27:51,538 | WARN | Edit log tailer | Unable to trigger a roll of the active NN | EditLogTailer.java:316 java.io.IOException: Cannot find any valid remote NN to service request! at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$MultipleNameNodeProxy.call(EditLogTailer.java:439) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.triggerActiveLogRoll(EditLogTailer.java:298) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.access$800(EditLogTailer.java:70) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:355) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$400(EditLogTailer.java:324) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:341) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1691) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:443) at org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:337) After one namenode become active, standby NN still can not trigger log roll again because variable "nnLoopCount" is still 3, it can not init to 0. private NamenodeProtocol getActiveNodeProxy() throws IOException { if (cachedActiveProxy == null ) { while ( true ) { // if we have reached the max loop count, quit by returning null if ((nnLoopCount / nnCount) >= maxRetries) { return null ; } ...... } } assert cachedActiveProxy != null ; return cachedActiveProxy; }
          Hide
          xingfengshen XingFeng Shen added a comment -

          Reproduce steps:
          1) stop two zkfc
          2) restart active nn, then nn will become standby.
          3) we can check the standby NN logs which will throw the exception.
          4) start two zkfc, nn will become ative nn. Standby NN will also throw exception again.

          Show
          xingfengshen XingFeng Shen added a comment - Reproduce steps: 1) stop two zkfc 2) restart active nn, then nn will become standby. 3) we can check the standby NN logs which will throw the exception. 4) start two zkfc, nn will become ative nn. Standby NN will also throw exception again.
          Hide
          xingfengshen XingFeng Shen added a comment - - edited

          Hi Brahma Reddy Battula,Vinayakumar B please help to check this issue and give me some suggestions

          Show
          xingfengshen XingFeng Shen added a comment - - edited Hi Brahma Reddy Battula , Vinayakumar B please help to check this issue and give me some suggestions
          Hide
          vinayrpet Vinayakumar B added a comment -

          XingFeng Shen, Thanks for the fix,
          1. Fix looks good.
          2. Please add a MiniDFSCluster test.
          Scenario is simple. Have a cluster with 3 namenodes, all in standby. And verify that periodic editlog trailing happens.

          3. Hit the "Submit patch" once the patch is uploaded.

          Show
          vinayrpet Vinayakumar B added a comment - XingFeng Shen , Thanks for the fix, 1. Fix looks good. 2. Please add a MiniDFSCluster test. Scenario is simple. Have a cluster with 3 namenodes, all in standby. And verify that periodic editlog trailing happens. 3. Hit the "Submit patch" once the patch is uploaded.
          Hide
          xingfengshen XingFeng Shen added a comment -

          Vinayakumar B, I had update the patch, please help to check it .

          Show
          xingfengshen XingFeng Shen added a comment - Vinayakumar B , I had update the patch, please help to check it .
          Hide
          xingfengshen XingFeng Shen added a comment -

          Looks like build failed here may be due to env issue...
          Can you help to trigger it again ?

          Show
          xingfengshen XingFeng Shen added a comment - Looks like build failed here may be due to env issue... Can you help to trigger it again ?
          Hide
          xingfengshen XingFeng Shen added a comment -

          same patch again to rebuild hadoop CI

          Show
          xingfengshen XingFeng Shen added a comment - same patch again to rebuild hadoop CI
          Hide
          hadoopqa Hadoop QA added a comment -
          -1 overall



          Vote Subsystem Runtime Comment
          0 reexec 0m 22s Docker mode activated.
          +1 @author 0m 0s The patch does not contain any @author tags.
          +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
          +1 mvninstall 6m 19s trunk passed
          +1 compile 0m 45s trunk passed
          +1 checkstyle 0m 25s trunk passed
          +1 mvnsite 0m 51s trunk passed
          +1 mvneclipse 0m 12s trunk passed
          +1 findbugs 1m 43s trunk passed
          +1 javadoc 0m 56s trunk passed
          +1 mvninstall 0m 48s the patch passed
          +1 compile 0m 43s the patch passed
          +1 javac 0m 43s the patch passed
          +1 checkstyle 0m 23s the patch passed
          +1 mvnsite 0m 50s the patch passed
          +1 mvneclipse 0m 10s the patch passed
          +1 whitespace 0m 0s The patch has no whitespace issues.
          +1 findbugs 1m 47s the patch passed
          +1 javadoc 0m 53s the patch passed
          -1 unit 72m 52s hadoop-hdfs in the patch failed.
          +1 asflicense 0m 19s The patch does not generate ASF License warnings.
          91m 34s



          Reason Tests
          Failed junit tests hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer
            hadoop.hdfs.server.namenode.TestDecommissioningStatus
            hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot



          Subsystem Report/Notes
          Docker Image:yetus/hadoop:85209cc
          JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12813021/HDFS-10536.02.patch
          JIRA Issue HDFS-10536
          Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
          uname Linux f93dcbea928b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
          Build tool maven
          Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
          git revision trunk / 6314843
          Default Java 1.8.0_91
          findbugs v3.0.0
          unit https://builds.apache.org/job/PreCommit-HDFS-Build/15902/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
          Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15902/testReport/
          modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
          Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15902/console
          Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org

          This message was automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 22s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 6m 19s trunk passed +1 compile 0m 45s trunk passed +1 checkstyle 0m 25s trunk passed +1 mvnsite 0m 51s trunk passed +1 mvneclipse 0m 12s trunk passed +1 findbugs 1m 43s trunk passed +1 javadoc 0m 56s trunk passed +1 mvninstall 0m 48s the patch passed +1 compile 0m 43s the patch passed +1 javac 0m 43s the patch passed +1 checkstyle 0m 23s the patch passed +1 mvnsite 0m 50s the patch passed +1 mvneclipse 0m 10s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 1m 47s the patch passed +1 javadoc 0m 53s the patch passed -1 unit 72m 52s hadoop-hdfs in the patch failed. +1 asflicense 0m 19s The patch does not generate ASF License warnings. 91m 34s Reason Tests Failed junit tests hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer   hadoop.hdfs.server.namenode.TestDecommissioningStatus   hadoop.hdfs.server.namenode.snapshot.TestOpenFilesWithSnapshot Subsystem Report/Notes Docker Image:yetus/hadoop:85209cc JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12813021/HDFS-10536.02.patch JIRA Issue HDFS-10536 Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux f93dcbea928b 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 6314843 Default Java 1.8.0_91 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-HDFS-Build/15902/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt Test Results https://builds.apache.org/job/PreCommit-HDFS-Build/15902/testReport/ modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs Console output https://builds.apache.org/job/PreCommit-HDFS-Build/15902/console Powered by Apache Yetus 0.4.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
          Hide
          vinayrpet Vinayakumar B added a comment -

          v2 patch looks good. +1.
          Will commit today.

          Show
          vinayrpet Vinayakumar B added a comment - v2 patch looks good. +1. Will commit today.
          Hide
          vinayrpet Vinayakumar B added a comment -

          Committed to trunk,
          Thanks for the contribution XingFeng Shen.

          Show
          vinayrpet Vinayakumar B added a comment - Committed to trunk, Thanks for the contribution XingFeng Shen .
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in Hadoop-trunk-Commit #10020 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10020/)
          HDFS-10536. Standby NN can not trigger log roll after EditLogTailer (vinayakumarb: rev 73615a789d96292e2731b5aa33ce46aa004d8211)

          • hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java
          • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-trunk-Commit #10020 (See https://builds.apache.org/job/Hadoop-trunk-Commit/10020/ ) HDFS-10536 . Standby NN can not trigger log roll after EditLogTailer (vinayakumarb: rev 73615a789d96292e2731b5aa33ce46aa004d8211) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/namenode/ha/TestEditLogTailer.java hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/ha/EditLogTailer.java

            People

            • Assignee:
              xingfengshen XingFeng Shen
              Reporter:
              xingfengshen XingFeng Shen
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development