Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-7763

fix zkfc hung issue due to not catching exception in a corner case

    Details

      Description

      In our product cluster, we hit both the two zkfc process is hung after a zk network outage.

      the zkfc log said:

      2015-02-07,17:40:11,875 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3334ms for sessionid 0x4a61bacdd9dfb2, closing socket connection and attempting reconnect
      2015-02-07,17:40:11,977 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
      2015-02-07,17:40:12,425 INFO org.apache.zookeeper.ZooKeeper: Session: 0x4a61bacdd9dfb2 closed
      2015-02-07,17:40:12,425 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.
      2015-02-07,17:40:12,425 INFO org.apache.hadoop.ipc.Server: Stopping server on 11300
      2015-02-07,17:40:12,425 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x4a61bacdd9dfb2
      2015-02-07,17:40:12,426 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
      2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
      2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
      2015-02-07,17:40:12,426 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
      2015-02-07,17:40:12,426 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 11300
      

      and the thread dump also be uploaded as attachment.
      From the dump, we can see due to the unknown non-daemon threads(pool-thread), the process did not exit, but the critical threads, like health monitor and rpc threads had been stopped, so our watchdog(supervisord) had not not observed the zkfc process is down or abnormal. so the following namenode failover could not be done as expected.

      there're two possible fixes here, 1) figure out the unset-thread-name, like pool-7-thread-1, where them came from and close or set daemon property. i tried to search but got nothing right now. 2) catch the exception from ZKFailoverController.run() so we can continue to exec the System.exit, the attached patch is 2).

      1. HDFS-7763-001.txt
        0.9 kB
        Liang Xie
      2. HDFS-7763-002.txt
        0.9 kB
        Liang Xie
      3. jstack.4936
        12 kB
        Liang Xie

        Activity

        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12697678/jstack.4936
        against trunk revision e0ec071.

        -1 patch. The patch command could not apply the patch.

        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9513//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697678/jstack.4936 against trunk revision e0ec071. -1 patch . The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9513//console This message is automatically generated.
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12697682/HDFS-7763-001.txt
        against trunk revision e0ec071.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9515//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/9515//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9515//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12697682/HDFS-7763-001.txt against trunk revision e0ec071. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. -1 findbugs . The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9515//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HDFS-Build/9515//artifact/patchprocess/newPatchFindbugsWarningshadoop-hdfs.html Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9515//console This message is automatically generated.
        Hide
        xieliang007 Liang Xie added a comment -

        The findbugs warning is not related with current fix: "Inconsistent synchronization of org.apache.hadoop.hdfs.server.namenode.BackupImage.namesystem; locked 60% of time"

        Show
        xieliang007 Liang Xie added a comment - The findbugs warning is not related with current fix: "Inconsistent synchronization of org.apache.hadoop.hdfs.server.namenode.BackupImage.namesystem; locked 60% of time"
        Hide
        cmccabe Colin P. McCabe added a comment -

        I feel a bit confused here. Shouldn't exiting with an exception cause the exception to be logged?

        +    int retCode = 0;
        +    try {
        +      retCode = zkfc.run(parser.getRemainingArgs());
        +    } catch (Exception e) {
        +      LOG.warn("", e);
        +    }
        +    System.exit(retCode);
        

        Do you want to catch Throwable rather than Exception there? It seems like you would like to catch all exceptions.

              LOG.warn("", e);
        

        Why no message?

        Show
        cmccabe Colin P. McCabe added a comment - I feel a bit confused here. Shouldn't exiting with an exception cause the exception to be logged? + int retCode = 0; + try { + retCode = zkfc.run(parser.getRemainingArgs()); + } catch (Exception e) { + LOG.warn("", e); + } + System .exit(retCode); Do you want to catch Throwable rather than Exception there? It seems like you would like to catch all exceptions. LOG.warn("", e); Why no message?
        Hide
        xieliang007 Liang Xie added a comment -

        How about v2, Colin?

        Show
        xieliang007 Liang Xie added a comment - How about v2, Colin?
        Hide
        hadoopqa Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12698304/HDFS-7763-002.txt
        against trunk revision 8a54384.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 javadoc. There were no new javadoc warning messages.

        +1 eclipse:eclipse. The patch built with eclipse:eclipse.

        +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.

        Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9554//testReport/
        Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9554//console

        This message is automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall . Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12698304/HDFS-7763-002.txt against trunk revision 8a54384. +1 @author . The patch does not contain any @author tags. -1 tests included . The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javac . The applied patch does not increase the total number of javac compiler warnings. +1 javadoc . There were no new javadoc warning messages. +1 eclipse:eclipse . The patch built with eclipse:eclipse. +1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit . The applied patch does not increase the total number of release audit warnings. +1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs. Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9554//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9554//console This message is automatically generated.
        Hide
        andrew.wang Andrew Wang added a comment -

        This looks good to me, though one little nit is we could do System.exit in a finally.

        +1, I'll commit shortly.

        Show
        andrew.wang Andrew Wang added a comment - This looks good to me, though one little nit is we could do System.exit in a finally . +1, I'll commit shortly.
        Hide
        andrew.wang Andrew Wang added a comment -

        Committed to trunk and branch-2, thanks for the nice find and fix Liang Xie!

        Show
        andrew.wang Andrew Wang added a comment - Committed to trunk and branch-2, thanks for the nice find and fix Liang Xie !
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-trunk-Commit #7193 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7193/)
        HDFS-7763. fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-trunk-Commit #7193 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7193/ ) HDFS-7763 . fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #115 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/115/)
        HDFS-7763. fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #115 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/115/ ) HDFS-7763 . fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Hadoop-Yarn-trunk #849 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/849/)
        HDFS-7763. fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Hadoop-Yarn-trunk #849 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/849/ ) HDFS-7763 . fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk #2047 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2047/)
        HDFS-7763. fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk #2047 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2047/ ) HDFS-7763 . fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #106 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/106/)
        HDFS-7763. fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b)

        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #106 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/106/ ) HDFS-7763 . fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b) hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #115 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/115/)
        HDFS-7763. fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #115 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/115/ ) HDFS-7763 . fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Hadoop-Mapreduce-trunk #2065 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2065/)
        HDFS-7763. fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b)

        • hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java
        • hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Hadoop-Mapreduce-trunk #2065 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2065/ ) HDFS-7763 . fix zkfc hung issue due to not catching exception in a corner case. Contributed by Liang Xie. (wang: rev 7105ebaa9f370db04962a1e19a67073dc080433b) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/tools/DFSZKFailoverController.java hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
        Hide
        vinodkv Vinod Kumar Vavilapalli added a comment -

        Sangjin Lee backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation, the patch applied cleanly.

        Show
        vinodkv Vinod Kumar Vavilapalli added a comment - Sangjin Lee backported this to 2.6.1. I just pushed the commit to 2.6.1 after running compilation, the patch applied cleanly.

          People

          • Assignee:
            xieliang007 Liang Xie
            Reporter:
            xieliang007 Liang Xie
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development