ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1140

server shutdown is not stopping threads

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.4.0
    • Fix Version/s: 3.4.0
    • Component/s: server, tests
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Near the end of QuorumZxidSyncTest there are tons of threads running - 115 "ProcessThread" threads, similar numbers of SessionTracker.

      Also I see ~100 ReadOnlyRequestProcessor - why is this running as a separate thread? (henry/flavio?)

        Activity

        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #1288 (See https://builds.apache.org/job/ZooKeeper-trunk/1288/)
        ZOOKEEPER-1140. server shutdown is not stopping threads. (laxman via mahadev)

        mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163102
        Files :

        • /zookeeper/trunk/CHANGES.txt
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java
        • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyZooKeeperServer.java
        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #1288 (See https://builds.apache.org/job/ZooKeeper-trunk/1288/ ) ZOOKEEPER-1140 . server shutdown is not stopping threads. (laxman via mahadev) mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1163102 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerHandler.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyZooKeeperServer.java
        Hide
        Laxman added a comment -

        Thanks for review and commit Mahadev.

        Show
        Laxman added a comment - Thanks for review and commit Mahadev.
        Hide
        Mahadev konar added a comment -

        I just committed this. Thanks Laxman!

        Show
        Mahadev konar added a comment - I just committed this. Thanks Laxman!
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12490942/ZOOKEEPER-1140.patch
        against trunk revision 1163015.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12490942/ZOOKEEPER-1140.patch against trunk revision 1163015. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/478//console This message is automatically generated.
        Hide
        Mahadev konar added a comment -

        +1 looks good to me.

        Show
        Mahadev konar added a comment - +1 looks good to me.
        Hide
        Laxman added a comment -

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        This fix doesn't include any new functionality. Fixes the thread leaks for which I'm not able to automate a testcase.

        Manually verified the QuorumZxidSyncTest and checked the thread dumps to see any leaks in debug mode. All the thread leaks noticed were fixed with this patch.

        Show
        Laxman added a comment - -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. This fix doesn't include any new functionality. Fixes the thread leaks for which I'm not able to automate a testcase. Manually verified the QuorumZxidSyncTest and checked the thread dumps to see any leaks in debug mode. All the thread leaks noticed were fixed with this patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12490942/ZOOKEEPER-1140.patch
        against trunk revision 1159432.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/468//testReport/
        Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/468//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/468//console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12490942/ZOOKEEPER-1140.patch against trunk revision 1159432. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/468//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/468//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/468//console This message is automatically generated.
        Hide
        Laxman added a comment -

        Patch fixes both the thread leaks noticed as per the a/m approach.

        Show
        Laxman added a comment - Patch fixes both the thread leaks noticed as per the a/m approach.
        Hide
        Laxman added a comment -

        Shouldn't R/O support only be enabled via an explicit configuration option? i.e. off by default.

        I feel we may need to address the above comment in separate jira.

        Show
        Laxman added a comment - Shouldn't R/O support only be enabled via an explicit configuration option? i.e. off by default. I feel we may need to address the above comment in separate jira.
        Hide
        Laxman added a comment -

        Two leaks has been identified.

        1) SendThread is getting leaked. [LearnerHandler]
        Approach to fix
        a) Add the proposalOfDeath in shutdown() rather in finally block.

        2) ROZK is getting shutdown even before startup.
        Approach to fix
        a) Synchronize the startup and shutdown of ROZK to avoid concurrent startp and shutdown.
        b) Add a flag "shutdown" and check in startup to avoid startup after shutdown.
        c) Interrupt the thread that is starting ROZK and join on that thread in finally block before calling shutdown.

        Both of these leaks I'm planning to address in my patch as per a/m approach.

        Please let me know if you have any different approach or any suggestion.

        Show
        Laxman added a comment - Two leaks has been identified. 1) SendThread is getting leaked. [LearnerHandler] Approach to fix a) Add the proposalOfDeath in shutdown() rather in finally block. 2) ROZK is getting shutdown even before startup. Approach to fix a) Synchronize the startup and shutdown of ROZK to avoid concurrent startp and shutdown. b) Add a flag "shutdown" and check in startup to avoid startup after shutdown. c) Interrupt the thread that is starting ROZK and join on that thread in finally block before calling shutdown. Both of these leaks I'm planning to address in my patch as per a/m approach. Please let me know if you have any different approach or any suggestion.
        Hide
        Laxman added a comment -

        Thread [Sender-/127.0.0.1:3084] (Running)

        When I observed QuorumZxidSyncTest, the above thread is also leaking in a different scenario apart from ReadOnlyZKServer. Above mentioned thread will be shutdown only if it receives proposalOfDeath.

        We are queuing proposalOfDeath in finally block of LearnerHandler.run().

        To summarize, proposalOfDeath may not queued up when LearnerHandler receives IOException and other thread calling LearnerHandler.shutdown(). This leads to failure of queuing the proposalOfDeath.

        To solve this, can we add the proposalOfDeath in shutdown() rather in finally block.

        Anyways, this finding will solve one of the leaks. We still need to fix other leaks caused by ROZK.

        Just to see if I understand now, are you saying that the test is spawning over one hundred servers because we are shutting before the server actually starts? If so, then it is certainly a problem.

        Yes. I will add more analysis and possibly patch today.

        Show
        Laxman added a comment - Thread [Sender-/127.0.0.1:3084] (Running) When I observed QuorumZxidSyncTest, the above thread is also leaking in a different scenario apart from ReadOnlyZKServer. Above mentioned thread will be shutdown only if it receives proposalOfDeath. We are queuing proposalOfDeath in finally block of LearnerHandler.run(). To summarize, proposalOfDeath may not queued up when LearnerHandler receives IOException and other thread calling LearnerHandler.shutdown(). This leads to failure of queuing the proposalOfDeath. To solve this, can we add the proposalOfDeath in shutdown() rather in finally block. Anyways, this finding will solve one of the leaks. We still need to fix other leaks caused by ROZK. Just to see if I understand now, are you saying that the test is spawning over one hundred servers because we are shutting before the server actually starts? If so, then it is certainly a problem. Yes. I will add more analysis and possibly patch today.
        Hide
        Flavio Junqueira added a comment -

        Laxman, I might have misinterpreted your comment initially about the shutting down the server before it starts. Just to see if I understand now, are you saying that the test is spawning over one hundred servers because we are shutting before the server actually starts? If so, then it is certainly a problem.

        Show
        Flavio Junqueira added a comment - Laxman, I might have misinterpreted your comment initially about the shutting down the server before it starts. Just to see if I understand now, are you saying that the test is spawning over one hundred servers because we are shutting before the server actually starts? If so, then it is certainly a problem.
        Hide
        Patrick Hunt added a comment -

        Pat, I think you had some insight for why it should be off by default. Could you remind me what it is, please?

        We don't want to change the default behavior of the server during a minor release - if people upgrade they may not be expecting the server to drop into r/o mode automatically. New features such as this should be "off by default" with a configuration option to enable.

        Show
        Patrick Hunt added a comment - Pat, I think you had some insight for why it should be off by default. Could you remind me what it is, please? We don't want to change the default behavior of the server during a minor release - if people upgrade they may not be expecting the server to drop into r/o mode automatically. New features such as this should be "off by default" with a configuration option to enable.
        Hide
        Laxman added a comment -

        Reposting the comment as I'm not able to edit my previous comment.

        Thanks for the clarifications Flavio.
        I agree with your explanation in #1 and #2.

        shutdown() should have no effect if the zookeeper server hasn't started. Are you concerned that we might end up creating RO zookeeper servers and using them unnecessarily?

        Yes, while debugging the flow I've noticed that many servers are getting spawned continuously.

        Show
        Laxman added a comment - Reposting the comment as I'm not able to edit my previous comment. Thanks for the clarifications Flavio. I agree with your explanation in #1 and #2. shutdown() should have no effect if the zookeeper server hasn't started. Are you concerned that we might end up creating RO zookeeper servers and using them unnecessarily? Yes, while debugging the flow I've noticed that many servers are getting spawned continuously.
        Hide
        Laxman added a comment -

        Thanks for the clarifications Flavio.
        I agree with your explanation in #1 and #2.

        shutdown() should have no effect if the zookeeper server hasn't started. Are you concerned that we might end up creating RO zookeeper servers and using them unnecessarily?

        Show
        Laxman added a comment - Thanks for the clarifications Flavio. I agree with your explanation in #1 and #2. shutdown() should have no effect if the zookeeper server hasn't started. Are you concerned that we might end up creating RO zookeeper servers and using them unnecessarily?
        Hide
        Flavio Junqueira added a comment -

        Some comments:

        1. Waiting for some grace period to start the RO zookeeper server avoids switching back and forth frequently when connectivity is intermittent;
        2. Calling startup() in a new thread makes sense because otherwise we hold leader election until the grace period for RO expires. The alternative I see is to check for this grace period in the beginning of the run() method of RORequestProcessor;
        3. shutdown() should have no effect if the zookeeper server hasn't started. Are you concerned that we might end up creating RO zookeeper servers and using them unnecessarily?
        4. Pat, I think you had some insight for why it should be off by default. Could you remind me what it is, please?
        Show
        Flavio Junqueira added a comment - Some comments: Waiting for some grace period to start the RO zookeeper server avoids switching back and forth frequently when connectivity is intermittent; Calling startup() in a new thread makes sense because otherwise we hold leader election until the grace period for RO expires. The alternative I see is to check for this grace period in the beginning of the run() method of RORequestProcessor; shutdown() should have no effect if the zookeeper server hasn't started. Are you concerned that we might end up creating RO zookeeper servers and using them unnecessarily? Pat, I think you had some insight for why it should be off by default. Could you remind me what it is, please?
        Hide
        Patrick Hunt added a comment -

        Shouldn't R/O support only be enabled via an explicit configuration option? i.e. off by default.

        Show
        Patrick Hunt added a comment - Shouldn't R/O support only be enabled via an explicit configuration option? i.e. off by default.
        Hide
        Patrick Hunt added a comment -

        Comment on list from lakshman_ch:

        In QuorumPeer, when the peer is in LOOKING state we are starting
        ReadOnlyZooKeeperServer in a separate thread. And we are shutting down this
        server even before startup which has no effect. Also, as this is not a
        blocking call QP keeps on spawning new servers.

        1) ReadOnlyZooKeeperServer.startup() need not be called in separate a
        thread.
        2) ReadOnlyZooKeeperServer.startup() is not a blocking call. Need to
        introduce a method like Leader.lead(), Follower.followLeader()
        3) Shutdown should be called only after the a/m blocking call is returned.

        Show
        Patrick Hunt added a comment - Comment on list from lakshman_ch: In QuorumPeer, when the peer is in LOOKING state we are starting ReadOnlyZooKeeperServer in a separate thread. And we are shutting down this server even before startup which has no effect. Also, as this is not a blocking call QP keeps on spawning new servers. 1) ReadOnlyZooKeeperServer.startup() need not be called in separate a thread. 2) ReadOnlyZooKeeperServer.startup() is not a blocking call. Need to introduce a method like Leader.lead(), Follower.followLeader() 3) Shutdown should be called only after the a/m blocking call is returned.

          People

          • Assignee:
            Laxman
            Reporter:
            Patrick Hunt
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development