ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1109

Zookeeper service is down when SyncRequestProcessor meets any exception.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: 3.3.0, 3.3.1, 3.3.2, 3.3.3
    • Fix Version/s: 3.4.0
    • Component/s: quorum
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Tags:
      quorum, leader, disk full, shutdown

      Description

      Problem Zookeeper is not shut down completely when dataDir disk space is full and ZK Cluster went into unserviceable state.

      Scenario
      If the leader zookeeper disk is made full, the zookeeper is trying to shutdown. But it is waiting indefinitely while shutting down the SyncRequestProcessor thread.

      Root Cause
      this.join() is invoked in the same thread where System.exit(11) has been triggered.

      When disk space full happens, It got the exception as follows 'No space left on device' and invoked System.exit(11) from the SyncRequestProcessor thread(The following logs shows the same). Before exiting JVM, ZK will execute the ShutdownHook of QuorumPeerMain and the flow comes to SyncRequestProcessor.shutdown(). Here this.join() is invoked in the same thread where System.exit(11) has been invoked.

      1. ZOOKEEPER-1109.patch
        1 kB
        Laxman
      2. ZOOKEEPER-1109.1.patch
        1 kB
        Laxman

        Issue Links

          Activity

          Hide
          Laxman added a comment -

          System.exit and thread.join on same thread is causing this hang. This has introduced as part of ZOOKEEPER-121.

          Show
          Laxman added a comment - System.exit and thread.join on same thread is causing this hang. This has introduced as part of ZOOKEEPER-121 .
          Hide
          Laxman added a comment -

          Reposting the comments and analysis

          I've also gone through Ted's earlier response on disk full scenario.
          http://www.google.co.in/url?sa=t&source=web&cd=3&ved=0CCAQFjAC&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fzookeeper-user%2F201106.mbox%2F%253CBANLkTimzQjXZvDKnP6xQLF9jHfsaz6JstA%40mail.gmail.com%253E&ei=FBQETvPWIcLNrQfk75yaDA&usg=AFQjCNFTkguyxTligpz1TZBmkqe9Osz-uw

          We feel, even when one of the cluster member's disk is full, we should not interrupt the complete service from entire cluster.

          Thread dumps

          The following thread dump shows the QuorumPeerMain thread is infntely waiting inside SyncRequestProcessor.

          "Thread-2" prio=10 tid=0x0810a400 nid=0x1695 in Object.wait() [0xac783000] 
             java.lang.Thread.State: WAITING (on object monitor) 
                  at java.lang.Object.wait(Native Method) 
                  - waiting on <0xb804f5e8> (a org.apache.zookeeper.server.SyncRequestProcessor) 
                  at java.lang.Thread.join(Thread.java:1143) 
                  - locked <0xb804f5e8> (a org.apache.zookeeper.server.SyncRequestProcessor) 
                  at java.lang.Thread.join(Thread.java:1196) 
                  at org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:171) 
                  at org.apache.zookeeper.server.quorum.ProposalRequestProcessor.shutdown(ProposalRequestProcessor.java:79) 
                  at org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:513) 
                  at org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:413) 
                  at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:411) 
                  at org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694) 
                  at org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java:126) 
          
          "SyncThread:2" prio=10 tid=0xad7fd800 nid=0x4acb in Object.wait() [0xac9ba000] 
             java.lang.Thread.State: WAITING (on object monitor) 
                  at java.lang.Object.wait(Native Method) 
                  - waiting on <0xb8030d00> (a org.apache.zookeeper.server.quorum.QuorumPeerMain$1) 
                  at java.lang.Thread.join(Thread.java:1143) 
                  - locked <0xb8030d00> (a org.apache.zookeeper.server.quorum.QuorumPeerMain$1) 
                  at java.lang.Thread.join(Thread.java:1196) 
                  at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79) 
                  at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24) 
                  at java.lang.Shutdown.runHooks(Shutdown.java:79) 
                  at java.lang.Shutdown.sequence(Shutdown.java:123) 
                  at java.lang.Shutdown.exit(Shutdown.java:168) 
                  - locked <0xf01ff3b0> (a java.lang.Class for java.lang.Shutdown) 
                  at java.lang.Runtime.exit(Runtime.java:90) 
                  at java.lang.System.exit(System.java:904) 
                  at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:149)
          

          Logs

          2011-06-21 10:09:59,730 - FATAL [SyncThread:2:SyncRequestProcessor@148] - Severe unrecoverable error, exiting 
          java.io.IOException: No space left on device 
                  at java.io.FileOutputStream.writeBytes(Native Method) 
                  at java.io.FileOutputStream.write(FileOutputStream.java:260) 
                  at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) 
                  at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) 
                  at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:305) 
                  at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:324) 
                  at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) 
                  at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:158) 
                  at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:98) 
          2011-06-21 10:09:59,732 - INFO  [Thread-2:QuorumPeer@691] - The Quorum server is going for shutdown 
          2011-06-21 10:09:59,732 - INFO  [Thread-2:Leader@393] - Shutdown called 
          java.lang.Exception: shutdown Leader! reason: quorum Peer shutdown 
                  at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:393) 
                  at org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694) 
                  at org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java:126) 
          2011-06-21 10:09:59,733 - INFO  [Thread-6:Leader$LearnerCnxAcceptor@243] - exception while shutting down acceptor: java.net.SocketException: Socket closed 
          2011-06-21 10:09:59,758 - INFO  [ProcessThread:-1:PrepRequestProcessor@120] - PrepRequestProcessor exited loop! 
          2011-06-21 10:09:59,758 - INFO  [CommitProcessor:2:CommitProcessor@150] - CommitProcessor exited loop! 
          2011-06-21 10:09:59,759 - INFO  [Thread-2:FinalRequestProcessor@379] - shutdown of request processor complete 
          2011-06-21 10:10:00,000 - INFO  [SessionTracker:SessionTrackerImpl@165] - SessionTrackerImpl exited loop! 
          
          Show
          Laxman added a comment - Reposting the comments and analysis I've also gone through Ted's earlier response on disk full scenario. http://www.google.co.in/url?sa=t&source=web&cd=3&ved=0CCAQFjAC&url=http%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fzookeeper-user%2F201106.mbox%2F%253CBANLkTimzQjXZvDKnP6xQLF9jHfsaz6JstA%40mail.gmail.com%253E&ei=FBQETvPWIcLNrQfk75yaDA&usg=AFQjCNFTkguyxTligpz1TZBmkqe9Osz-uw We feel, even when one of the cluster member's disk is full, we should not interrupt the complete service from entire cluster. Thread dumps The following thread dump shows the QuorumPeerMain thread is infntely waiting inside SyncRequestProcessor. "Thread-2" prio=10 tid=0x0810a400 nid=0x1695 in Object.wait() [0xac783000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xb804f5e8> (a org.apache.zookeeper.server.SyncRequestProcessor) at java.lang.Thread.join(Thread.java:1143) - locked <0xb804f5e8> (a org.apache.zookeeper.server.SyncRequestProcessor) at java.lang.Thread.join(Thread.java:1196) at org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcessor.java:171) at org.apache.zookeeper.server.quorum.ProposalRequestProcessor.shutdown(ProposalRequestProcessor.java:79) at org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcessor.java:513) at org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:413) at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:411) at org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694) at org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java:126) "SyncThread:2" prio=10 tid=0xad7fd800 nid=0x4acb in Object.wait() [0xac9ba000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xb8030d00> (a org.apache.zookeeper.server.quorum.QuorumPeerMain$1) at java.lang.Thread.join(Thread.java:1143) - locked <0xb8030d00> (a org.apache.zookeeper.server.quorum.QuorumPeerMain$1) at java.lang.Thread.join(Thread.java:1196) at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79) at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24) at java.lang.Shutdown.runHooks(Shutdown.java:79) at java.lang.Shutdown.sequence(Shutdown.java:123) at java.lang.Shutdown.exit(Shutdown.java:168) - locked <0xf01ff3b0> (a java.lang.Class for java.lang.Shutdown) at java.lang.Runtime.exit(Runtime.java:90) at java.lang.System.exit(System.java:904) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:149) Logs 2011-06-21 10:09:59,730 - FATAL [SyncThread:2:SyncRequestProcessor@148] - Severe unrecoverable error, exiting java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:305) at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:324) at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:158) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:98) 2011-06-21 10:09:59,732 - INFO [Thread-2:QuorumPeer@691] - The Quorum server is going for shutdown 2011-06-21 10:09:59,732 - INFO [Thread-2:Leader@393] - Shutdown called java.lang.Exception: shutdown Leader! reason: quorum Peer shutdown at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:393) at org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694) at org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java:126) 2011-06-21 10:09:59,733 - INFO [Thread-6:Leader$LearnerCnxAcceptor@243] - exception while shutting down acceptor: java.net.SocketException: Socket closed 2011-06-21 10:09:59,758 - INFO [ProcessThread:-1:PrepRequestProcessor@120] - PrepRequestProcessor exited loop! 2011-06-21 10:09:59,758 - INFO [CommitProcessor:2:CommitProcessor@150] - CommitProcessor exited loop! 2011-06-21 10:09:59,759 - INFO [Thread-2:FinalRequestProcessor@379] - shutdown of request processor complete 2011-06-21 10:10:00,000 - INFO [SessionTracker:SessionTrackerImpl@165] - SessionTrackerImpl exited loop!
          Hide
          Laxman added a comment -

          Tested the patch with debug points.
          Not able to add a testcase as this System.exit scenario.

          Patch
          If the shutdown has been triggered by this thread, we dont call this.join().

          Show
          Laxman added a comment - Tested the patch with debug points. Not able to add a testcase as this System.exit scenario. Patch If the shutdown has been triggered by this thread, we dont call this.join().
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12484757/ZOOKEEPER-1109.patch
          against trunk revision 1140017.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/360//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/360//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/360//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12484757/ZOOKEEPER-1109.patch against trunk revision 1140017. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/360//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/360//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/360//console This message is automatically generated.
          Hide
          Laxman added a comment -

          Test failure reported here doesn't seems to be introduced by this patch.
          All tests are verified locally and all are passing.

          Show
          Laxman added a comment - Test failure reported here doesn't seems to be introduced by this patch. All tests are verified locally and all are passing.
          Hide
          Mahadev konar added a comment - - edited

          Laxman,
          I think we should probably use volatile for boolean running? Other than that it looks good.

          Show
          Mahadev konar added a comment - - edited Laxman, I think we should probably use volatile for boolean running? Other than that it looks good.
          Hide
          Mahadev konar added a comment -

          Laxman,
          Any update? Are you planning to update the patch? If not please let me know.

          Show
          Mahadev konar added a comment - Laxman, Any update? Are you planning to update the patch? If not please let me know.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12486399/ZOOKEEPER-1109.1.patch
          against trunk revision 1146025.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/393//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/393//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/393//console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12486399/ZOOKEEPER-1109.1.patch against trunk revision 1146025. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/393//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/393//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/393//console This message is automatically generated.
          Hide
          Laxman added a comment -

          Hi Mahadev, any suggestions on the reworked patch?

          Show
          Laxman added a comment - Hi Mahadev, any suggestions on the reworked patch?
          Hide
          Mahadev konar added a comment -

          +1, patch looks good. Ill go ahead and commit. Thanks Laxman! Sorry for my late response.

          Show
          Mahadev konar added a comment - +1, patch looks good. Ill go ahead and commit. Thanks Laxman! Sorry for my late response.
          Hide
          Mahadev konar added a comment - - edited

          I am removing the fix version for 3.3.4. The patch doesnt apply to 3.3 branch. Ill let 3.3.4 Release manager decide is they want to back port this.

          Show
          Mahadev konar added a comment - - edited I am removing the fix version for 3.3.4. The patch doesnt apply to 3.3 branch. Ill let 3.3.4 Release manager decide is they want to back port this.
          Hide
          Mahadev konar added a comment -

          Just committed this to trunk. Thanks Laxman!

          Show
          Mahadev konar added a comment - Just committed this to trunk. Thanks Laxman!
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1255 (See https://builds.apache.org/job/ZooKeeper-trunk/1255/)
          ZOOKEEPER-1109. Zookeeper service is down when SyncRequestProcessor meets any exception. (Laxman via mahadev)

          mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1150903
          Files :

          • /zookeeper/trunk/CHANGES.txt
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/SyncRequestProcessor.java
          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1255 (See https://builds.apache.org/job/ZooKeeper-trunk/1255/ ) ZOOKEEPER-1109 . Zookeeper service is down when SyncRequestProcessor meets any exception. (Laxman via mahadev) mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1150903 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/SyncRequestProcessor.java

            People

            • Assignee:
              Laxman
              Reporter:
              Laxman
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 72h
                72h
                Remaining:
                Remaining Estimate - 72h
                72h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development