Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2247

Zookeeper service becomes unavailable when leader fails to write transaction log

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.5.0
    • Fix Version/s: 3.4.9, 3.5.3, 3.6.0
    • Component/s: None
    • Labels:
      None

      Description

      Zookeeper service becomes unavailable when leader fails to write transaction log. Bellow are the exceptions

      2015-08-14 15:41:18,556 [myid:100] - ERROR [SyncThread:100:ZooKeeperCriticalThread@48] - Severe unrecoverable error, from thread : SyncThread:100
      java.io.IOException: Input/output error
      	at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
      	at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
      	at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:376)
      	at org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:331)
      	at org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog.java:380)
      	at org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:563)
      	at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:178)
      	at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:113)
      2015-08-14 15:41:18,559 [myid:100] - INFO  [SyncThread:100:ZooKeeperServer$ZooKeeperServerListenerImpl@500] - Thread SyncThread:100 exits, error code 1
      2015-08-14 15:41:18,559 [myid:100] - INFO  [SyncThread:100:ZooKeeperServer@523] - shutting down
      2015-08-14 15:41:18,560 [myid:100] - INFO  [SyncThread:100:SessionTrackerImpl@232] - Shutting down
      2015-08-14 15:41:18,560 [myid:100] - INFO  [SyncThread:100:LeaderRequestProcessor@77] - Shutting down
      2015-08-14 15:41:18,560 [myid:100] - INFO  [SyncThread:100:PrepRequestProcessor@1035] - Shutting down
      2015-08-14 15:41:18,560 [myid:100] - INFO  [SyncThread:100:ProposalRequestProcessor@88] - Shutting down
      2015-08-14 15:41:18,561 [myid:100] - INFO  [SyncThread:100:CommitProcessor@356] - Shutting down
      2015-08-14 15:41:18,561 [myid:100] - INFO  [CommitProcessor:100:CommitProcessor@191] - CommitProcessor exited loop!
      2015-08-14 15:41:18,562 [myid:100] - INFO  [SyncThread:100:Leader$ToBeAppliedRequestProcessor@915] - Shutting down
      2015-08-14 15:41:18,562 [myid:100] - INFO  [SyncThread:100:FinalRequestProcessor@646] - shutdown of request processor complete
      2015-08-14 15:41:18,562 [myid:100] - INFO  [SyncThread:100:SyncRequestProcessor@191] - Shutting down
      2015-08-14 15:41:18,563 [myid:100] - INFO  [ProcessThread(sid:100 cport:-1)::PrepRequestProcessor@159] - PrepRequestProcessor exited loop!
      

      After this exception Leader server still remains leader. After this non recoverable exception the leader should go down and let other followers become leader.

      1. ZOOKEEPER-2247-01.patch
        37 kB
        Mohammad Arshad
      2. ZOOKEEPER-2247-02.patch
        37 kB
        Mohammad Arshad
      3. ZOOKEEPER-2247-03.patch
        45 kB
        Mohammad Arshad
      4. ZOOKEEPER-2247-04.patch
        45 kB
        Mohammad Arshad
      5. ZOOKEEPER-2247-05.patch
        35 kB
        Mohammad Arshad
      6. ZOOKEEPER-2247-06.patch
        19 kB
        Mohammad Arshad
      7. ZOOKEEPER-2247-07.patch
        21 kB
        Rakesh R
      8. ZOOKEEPER-2247-09.patch
        21 kB
        Rakesh R
      9. ZOOKEEPER-2247-10.patch
        21 kB
        Rakesh R
      10. ZOOKEEPER-2247-11.patch
        21 kB
        Rakesh R
      11. ZOOKEEPER-2247-12.patch
        22 kB
        Rakesh R
      12. ZOOKEEPER-2247-13.patch
        24 kB
        Rakesh R
      13. ZOOKEEPER-2247-14.patch
        25 kB
        Rakesh R
      14. ZOOKEEPER-2247-15.patch
        24 kB
        Rakesh R
      15. ZOOKEEPER-2247-16.patch
        30 kB
        Rakesh R
      16. ZOOKEEPER-2247-17.patch
        28 kB
        Rakesh R
      17. ZOOKEEPER-2247-18.patch
        28 kB
        Rakesh R
      18. ZOOKEEPER-2247-19.patch
        28 kB
        Rakesh R
      19. ZOOKEEPER-2247-20.patch
        28 kB
        Rakesh R
      20. ZOOKEEPER-2247-21.patch
        29 kB
        Rakesh R
      21. ZOOKEEPER-2247-22.patch
        29 kB
        Rakesh R
      22. ZOOKEEPER-2247-23.patch
        29 kB
        Rakesh R
      23. ZOOKEEPER-2247-b3.5.patch
        12 kB
        Flavio Junqueira
      24. ZOOKEEPER-2247-br-3.4.patch
        30 kB
        Rakesh R
      25. ZOOKEEPER-2247-br-3.4.patch
        24 kB
        Rakesh R

        Issue Links

          Activity

          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          The problem is because quorum is not being shutdown along with other services. We should also call shutdown on org.apache.zookeeper.server.quorum.QuorumPeer.

          Show
          arshad.mohammad Mohammad Arshad added a comment - The problem is because quorum is not being shutdown along with other services. We should also call shutdown on org.apache.zookeeper.server.quorum.QuorumPeer .
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          I will soon submit a patch

          Show
          arshad.mohammad Mohammad Arshad added a comment - I will soon submit a patch
          Hide
          rakeshr Rakesh R added a comment -

          Yes, it should do quorumpeer#shutdown inorder to stop the service. Also, could you verify the behavior of standalone server.

          Show
          rakeshr Rakesh R added a comment - Yes, it should do quorumpeer#shutdown inorder to stop the service. Also, could you verify the behavior of standalone server.
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          Standalone zookeeper server has the same behavior. After error neither it exists nor able to serve client request. I think in standalone case we should shutdown the process itself as it does not have option to switch to other role.

          Show
          arshad.mohammad Mohammad Arshad added a comment - Standalone zookeeper server has the same behavior. After error neither it exists nor able to serve client request. I think in standalone case we should shutdown the process itself as it does not have option to switch to other role.
          Hide
          fpj Flavio Junqueira added a comment -

          From the log lines in the description, it is shutting down, so I suppose you're saying that the same server keeps being re-elected? Do you what's wrong with your disk?

          Show
          fpj Flavio Junqueira added a comment - From the log lines in the description, it is shutting down, so I suppose you're saying that the same server keeps being re-elected? Do you what's wrong with your disk?
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Mohammad Arshad for the analysis. IMHO we should handle QuorumPeer and Standalone separately. One approach I'm thinking is, presently ZooKeeperServerListenerImpl is inside ZooKeeperServer.java we can move this to a separate java file.
          .

          • Standalone server can use this one, here instead of zks#shutdown() we should think of interrupting cnxnFactory and secureCnxnFactory and gracefully bring down the server.
          • QuorumPeer can do QuorumPeerListenerImpl extends ZooKeeperServerListenerImpl and stop respective observer/follower/leader service. This way will make the server to LOOKING state and allows to continue with the internal leader-election. This is similar to a situation where QuorumPeer hits unexpected exception and giving another chance to join quorum. Please refer below switch case logic, which is similar to the finally block of each QuorumPeer#L1055 server respectively.
            QuorumPeer#stopService(){
            	switch (getPeerState()) {
            		case OBSERVING:
                                   // stopObserver();
                                   observer.shutdown();
                                   setObserver(null);  
                                   updateServerState();
                                   break;
            		case FOLLOWING:
                                   // stopFollower();
                                   follower.shutdown();
                                   setFollower(null);
                                   updateServerState();
                                   break;
            		case LEADING:
                                   // stopLeader();
                                    if (leader != null) {
                                        leader.shutdown("Forcing shutdown");
                                        setLeader(null);
                                    }
                                    updateServerState();
                                   break;
            	}
            }
            
          Show
          rakeshr Rakesh R added a comment - Thanks Mohammad Arshad for the analysis. IMHO we should handle QuorumPeer and Standalone separately. One approach I'm thinking is, presently ZooKeeperServerListenerImpl is inside ZooKeeperServer.java we can move this to a separate java file. . Standalone server can use this one, here instead of zks#shutdown() we should think of interrupting cnxnFactory and secureCnxnFactory and gracefully bring down the server. QuorumPeer can do QuorumPeerListenerImpl extends ZooKeeperServerListenerImpl and stop respective observer/follower/leader service. This way will make the server to LOOKING state and allows to continue with the internal leader-election. This is similar to a situation where QuorumPeer hits unexpected exception and giving another chance to join quorum. Please refer below switch case logic, which is similar to the finally block of each QuorumPeer#L1055 server respectively. QuorumPeer#stopService(){ switch (getPeerState()) { case OBSERVING: // stopObserver(); observer.shutdown(); setObserver( null ); updateServerState(); break ; case FOLLOWING: // stopFollower(); follower.shutdown(); setFollower( null ); updateServerState(); break ; case LEADING: // stopLeader(); if (leader != null ) { leader.shutdown( "Forcing shutdown" ); setLeader( null ); } updateServerState(); break ; } }
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Flavio Junqueira for the interest. I haven't seen your comment and was composing the below comment My understanding about the case is, presently ZooKeeperServer.java#L474 does quorumserver#shutdown(), but here it is leaving quorum peer state as running(Observer/Follower/Leader). This is not calling Observer/Follower/Leader#shutdown the service. Also, zks#shutdown is stopping ElectionAlg.

          Show
          rakeshr Rakesh R added a comment - Thanks Flavio Junqueira for the interest. I haven't seen your comment and was composing the below comment My understanding about the case is, presently ZooKeeperServer.java#L474 does quorumserver#shutdown(), but here it is leaving quorum peer state as running(Observer/Follower/Leader). This is not calling Observer/Follower/Leader#shutdown the service. Also, zks#shutdown is stopping ElectionAlg .
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          1) No, same server is not being being re-elected again-again. Re-election is not happening at all, this is because quorum not stopped.
          2) Yes, the disk suddenly becomes read-only

          Show
          arshad.mohammad Mohammad Arshad added a comment - 1) No, same server is not being being re-elected again-again. Re-election is not happening at all, this is because quorum not stopped. 2) Yes, the disk suddenly becomes read-only
          Hide
          rakeshr Rakesh R added a comment -

          Also, zks#shutdown is stopping ElectionAlg.

          This is wrong statement, please ignore it.

          Show
          rakeshr Rakesh R added a comment - Also, zks#shutdown is stopping ElectionAlg. This is wrong statement, please ignore it.
          Hide
          fpj Flavio Junqueira added a comment -

          Bear with me because it doesn't make sense to me, the description of the jira has this:

          2015-08-14 15:41:18,559 [myid:100] - INFO  [SyncThread:100:ZooKeeperServer@523] - shutting down
          2015-08-14 15:41:18,560 [myid:100] - INFO  [SyncThread:100:SessionTrackerImpl@232] - Shutting down
          2015-08-14 15:41:18,560 [myid:100] - INFO  [SyncThread:100:LeaderRequestProcessor@77] - Shutting down
          2015-08-14 15:41:18,560 [myid:100] - INFO  [SyncThread:100:PrepRequestProcessor@1035] - Shutting down
          2015-08-14 15:41:18,560 [myid:100] - INFO  [SyncThread:100:ProposalRequestProcessor@88] - Shutting down
          2015-08-14 15:41:18,561 [myid:100] - INFO  [SyncThread:100:CommitProcessor@356] - Shutting down
          2015-08-14 15:41:18,561 [myid:100] - INFO  [CommitProcessor:100:CommitProcessor@191] - CommitProcessor exited loop!
          2015-08-14 15:41:18,562 [myid:100] - INFO  [SyncThread:100:Leader$ToBeAppliedRequestProcessor@915] - Shutting down
          2015-08-14 15:41:18,562 [myid:100] - INFO  [SyncThread:100:FinalRequestProcessor@646] - shutdown of request processor complete
          2015-08-14 15:41:18,562 [myid:100] - INFO  [SyncThread:100:SyncRequestProcessor@191] - Shutting down
          

          This says that the server pipeline is shutting down!

          Show
          fpj Flavio Junqueira added a comment - Bear with me because it doesn't make sense to me, the description of the jira has this: 2015-08-14 15:41:18,559 [myid:100] - INFO [SyncThread:100:ZooKeeperServer@523] - shutting down 2015-08-14 15:41:18,560 [myid:100] - INFO [SyncThread:100:SessionTrackerImpl@232] - Shutting down 2015-08-14 15:41:18,560 [myid:100] - INFO [SyncThread:100:LeaderRequestProcessor@77] - Shutting down 2015-08-14 15:41:18,560 [myid:100] - INFO [SyncThread:100:PrepRequestProcessor@1035] - Shutting down 2015-08-14 15:41:18,560 [myid:100] - INFO [SyncThread:100:ProposalRequestProcessor@88] - Shutting down 2015-08-14 15:41:18,561 [myid:100] - INFO [SyncThread:100:CommitProcessor@356] - Shutting down 2015-08-14 15:41:18,561 [myid:100] - INFO [CommitProcessor:100:CommitProcessor@191] - CommitProcessor exited loop! 2015-08-14 15:41:18,562 [myid:100] - INFO [SyncThread:100:Leader$ToBeAppliedRequestProcessor@915] - Shutting down 2015-08-14 15:41:18,562 [myid:100] - INFO [SyncThread:100:FinalRequestProcessor@646] - shutdown of request processor complete 2015-08-14 15:41:18,562 [myid:100] - INFO [SyncThread:100:SyncRequestProcessor@191] - Shutting down This says that the server pipeline is shutting down!
          Hide
          fpj Flavio Junqueira added a comment -

          Also, I'm not sure how the disk can suddenly become read-only. It doesn't seem to be full from the exception.

          Show
          fpj Flavio Junqueira added a comment - Also, I'm not sure how the disk can suddenly become read-only. It doesn't seem to be full from the exception.
          Hide
          rakeshr Rakesh R added a comment -

          Let me try to add more details. ZOOKEEPER-1907 has added ZooKeeperServerListenerImpl to handle exceptions from ZK thread. On receiving exception, ZooKeeperServerListenerImpl will call zks#shutdown(). The above exception trace is showing the same. Like I mentioned earlier here it missed to stop the server state (Observer/Follower/Leader) fully, due to this it won't trigger re-election again and reforms the server pipeline. Probably Mohammad Arshad can share the unit tests to understand it well.

          Show
          rakeshr Rakesh R added a comment - Let me try to add more details. ZOOKEEPER-1907 has added ZooKeeperServerListenerImpl to handle exceptions from ZK thread. On receiving exception, ZooKeeperServerListenerImpl will call zks#shutdown(). The above exception trace is showing the same. Like I mentioned earlier here it missed to stop the server state (Observer/Follower/Leader) fully, due to this it won't trigger re-election again and reforms the server pipeline. Probably Mohammad Arshad can share the unit tests to understand it well.
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          In test environment, after the zookeeper server start, disk is made read-only using an internal tool which executes commands similar to mount -o ro,remount /dev/sda3. yes, it is not a disk full scenario

          Show
          arshad.mohammad Mohammad Arshad added a comment - In test environment, after the zookeeper server start, disk is made read-only using an internal tool which executes commands similar to mount -o ro,remount /dev/sda3 . yes, it is not a disk full scenario
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          Test case are added as part of the ZOOKEEPER-2247-01.patch

          Show
          arshad.mohammad Mohammad Arshad added a comment - Test case are added as part of the ZOOKEEPER-2247 -01.patch
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12751287/ZOOKEEPER-2247-01.patch
          against trunk revision 1694317.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2830//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2830//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2830//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12751287/ZOOKEEPER-2247-01.patch against trunk revision 1694317. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2830//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2830//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2830//console This message is automatically generated.
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          Fixed findbug issue which was reported in pre-commit check.

          Show
          arshad.mohammad Mohammad Arshad added a comment - Fixed findbug issue which was reported in pre-commit check.
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Generally, lgtm.

          A few nits:

          • In src/java/systest/org/apache/zookeeper/test/system/NonRecoverableErrorTests.java, testZookeeperSericeShouldBeAvailableEvenAfterNonRecoverableErrorOnLeader could probably just be named testZooKeeperServiceAvailableOnLeader (or something along that length...)
          • In TestUtils, there's a typo in deleteFileRecusively
          • also about TestUtils.deleteFileRecursively - if you grep for recursiveDelete you'll see that a few tests have their own (repeated) definition as well:
          src/java/systest/org/apache/zookeeper/test/system/QuorumPeerInstance.java
          src/java/test/org/apache/zookeeper/test/ClientBase.java
          src/java/test/org/apache/zookeeper/server/quorum/LearnerTest.java
          src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
          
          • and that these are used across multiple test files... Can we have everyone using TestUtils.deleteFileRecursively while we are at it? It would be easier to clean it now than to do it in another patch...
          • given that QuorumPeer.setConfigFileName is protected, maybe it's easier/cleaner to extend QuorumPeer and add a public setter for configFileName. The problem with using reflection if it breaks, we'd only know after running CI.. whereas otherwise compilation would fail and we'd know sooner.
          Show
          rgs Raul Gutierrez Segales added a comment - Generally, lgtm. A few nits: In src/java/systest/org/apache/zookeeper/test/system/NonRecoverableErrorTests.java, testZookeeperSericeShouldBeAvailableEvenAfterNonRecoverableErrorOnLeader could probably just be named testZooKeeperServiceAvailableOnLeader (or something along that length...) In TestUtils, there's a typo in deleteFileRecusively also about TestUtils.deleteFileRecursively - if you grep for recursiveDelete you'll see that a few tests have their own (repeated) definition as well: src/java/systest/org/apache/zookeeper/test/system/QuorumPeerInstance.java src/java/test/org/apache/zookeeper/test/ClientBase.java src/java/test/org/apache/zookeeper/server/quorum/LearnerTest.java src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java and that these are used across multiple test files... Can we have everyone using TestUtils.deleteFileRecursively while we are at it? It would be easier to clean it now than to do it in another patch... given that QuorumPeer.setConfigFileName is protected, maybe it's easier/cleaner to extend QuorumPeer and add a public setter for configFileName. The problem with using reflection if it breaks, we'd only know after running CI.. whereas otherwise compilation would fail and we'd know sooner.
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          Thanks a lot Raul Gutierrez Segales for your review comments. Submitted new patch ZOOKEEPER-2247-03.patch with all your comments fixed except bellow

          The recursiveDelete method in src/java/test/org/apache/zookeeper/test/ClientBase.java has slightly different functionality from TestUtils.deleteFileRecursively method functionality.
          So can not replaced ClientBase.recursiveDelete with TestUtils.deleteFileRecursively

          Show
          arshad.mohammad Mohammad Arshad added a comment - Thanks a lot Raul Gutierrez Segales for your review comments. Submitted new patch ZOOKEEPER-2247 -03.patch with all your comments fixed except bellow The recursiveDelete method in src/java/test/org/apache/zookeeper/test/ClientBase.java has slightly different functionality from TestUtils.deleteFileRecursively method functionality. So can not replaced ClientBase.recursiveDelete with TestUtils.deleteFileRecursively
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          any idea why build is not triggered?

          Show
          arshad.mohammad Mohammad Arshad added a comment - any idea why build is not triggered?
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12752017/ZOOKEEPER-2247-04.patch
          against trunk revision 1697227.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 18 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2838//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2838//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2838//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12752017/ZOOKEEPER-2247-04.patch against trunk revision 1697227. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 18 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2838//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2838//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2838//console This message is automatically generated.
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          This patch is not reason for org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig test case failure. Verified the test case is passing locally

          Show
          arshad.mohammad Mohammad Arshad added a comment - This patch is not reason for org.apache.zookeeper.server.quorum.ReconfigRecoveryTest.testCurrentServersAreObserversInNextConfig test case failure. Verified the test case is passing locally
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Yeah, that one is flaky from time to time.

          cc: Alexander Shraer (in case you have some cycles to look at that failure case)

          Show
          rgs Raul Gutierrez Segales added a comment - Yeah, that one is flaky from time to time. cc: Alexander Shraer (in case you have some cycles to look at that failure case)
          Hide
          rgs Raul Gutierrez Segales added a comment -

          lgtm, +1 with a caveat: I am a bit worried about the getPrivateField/setPrivateField methods. I can't think of a cleaner approach now, but I suspect that doing it this is way isn't sustainable and might create issues in the future...

          Rakesh R: any ideas on a cleaner approach to get access to those private fields?

          Show
          rgs Raul Gutierrez Segales added a comment - lgtm, +1 with a caveat: I am a bit worried about the getPrivateField/setPrivateField methods. I can't think of a cleaner approach now, but I suspect that doing it this is way isn't sustainable and might create issues in the future... Rakesh R : any ideas on a cleaner approach to get access to those private fields?
          Hide
          rakeshr Rakesh R added a comment -

          I've few comments, please take a look at this:

          1. Please add log for the default: case. That would be helpful for debugging.
            QuorumPeer.java
            +        default:
            +            break;
            
          2. Please add time out parameter to the tests. @Test=600000 or something bigger calue depends on your case.
          3. Do we really need to create a new ConfigBaseSystemTest test class? Instead can we try reusing QuorumPeerTestBase.
          4. Few format issues, please correct these also.
            a)
            +    protected boolean isFakeMachines()
            +    {
            

            It should be

            +    protected boolean isFakeMachines() {
            

            b) Add one newline in between these two

            +    }
            +    public ZooKeeperServerListener getZooKeeperServerListener() {
            
          Show
          rakeshr Rakesh R added a comment - I've few comments, please take a look at this: Please add log for the default: case. That would be helpful for debugging. QuorumPeer.java + default : + break ; Please add time out parameter to the tests. @Test=600000 or something bigger calue depends on your case. Do we really need to create a new ConfigBaseSystemTest test class? Instead can we try reusing QuorumPeerTestBase . Few format issues, please correct these also. a) + protected boolean isFakeMachines() + { It should be + protected boolean isFakeMachines() { b) Add one newline in between these two + } + public ZooKeeperServerListener getZooKeeperServerListener() {
          Hide
          rakeshr Rakesh R added a comment -

          Rakesh R: any ideas on a cleaner approach to get access to those private fields?

          Raul Gutierrez Segales Thats really good point. I also feel this can be avoided. There is one possible way - Mock the ZKDatabase with your fileTxnSnapLogWithError and set this MockZKDatabase back to the ZooKeeperServer#setZKDatabase(mockDB). Please find the below sequence for simulating the case:

          step-1) 
          class MockZKDatabase extends ZKDatabase {
                  MockZKDatabase(FileTxnSnapLog fileTxnSnapLogWithError) {
                      super(snapLog);
                  }
          }
          
          step-2)
          MockZKDatabase mockDB = new MockZKDatabase(fileTxnSnapLogWithError);
          QuorumPeer leader = getLeaderQuorumPeer();
          leader.getActiveServer().setZKDatabase(mockDB);
          
          step-3)
          Execute your test scenario and see the execution is flowing through fileTxnSnapLogWithError and failing:)
          
          Show
          rakeshr Rakesh R added a comment - Rakesh R: any ideas on a cleaner approach to get access to those private fields? Raul Gutierrez Segales Thats really good point. I also feel this can be avoided. There is one possible way - Mock the ZKDatabase with your fileTxnSnapLogWithError and set this MockZKDatabase back to the ZooKeeperServer#setZKDatabase(mockDB). Please find the below sequence for simulating the case: step-1) class MockZKDatabase extends ZKDatabase { MockZKDatabase(FileTxnSnapLog fileTxnSnapLogWithError) { super (snapLog); } } step-2) MockZKDatabase mockDB = new MockZKDatabase(fileTxnSnapLogWithError); QuorumPeer leader = getLeaderQuorumPeer(); leader.getActiveServer().setZKDatabase(mockDB); step-3) Execute your test scenario and see the execution is flowing through fileTxnSnapLogWithError and failing:)
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          Thanks Rakesh R for your suggestion. I was using private accessors to get FileTxnSnapLog but this I can get even from ZooKeeperServer. So actually neither private accessor nor mock is required. Please find the fix in ZOOKEEPER-2247-05.patch

          Show
          arshad.mohammad Mohammad Arshad added a comment - Thanks Rakesh R for your suggestion. I was using private accessors to get FileTxnSnapLog but this I can get even from ZooKeeperServer . So actually neither private accessor nor mock is required. Please find the fix in ZOOKEEPER-2247 -05.patch
          Hide
          arshad.mohammad Mohammad Arshad added a comment -
          1. fixed
          2. fixed
          3. fixed, ConfigBaseSystemTest not necessary for this patch. Since this patch is not testing any configuration related issue, I used BaseSysTest)) instead of {{QuorumPeerTestBase
          4. fixed
          5. fixed
          Show
          arshad.mohammad Mohammad Arshad added a comment - fixed fixed fixed, ConfigBaseSystemTest not necessary for this patch. Since this patch is not testing any configuration related issue, I used BaseSysTest)) instead of {{QuorumPeerTestBase fixed fixed
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12752256/ZOOKEEPER-2247-05.patch
          against trunk revision 1697551.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 20 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2841//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2841//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2841//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12752256/ZOOKEEPER-2247-05.patch against trunk revision 1697551. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 20 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2841//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2841//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2841//console This message is automatically generated.
          Hide
          beeflyme caixiaofeng added a comment -

          mark

          Show
          beeflyme caixiaofeng added a comment - mark
          Hide
          rakeshr Rakesh R added a comment -

          Good work Mohammad Arshad, thanks for taking care this. Overall patch looks nice, just few more comments, please take a look at it.

          1. Could you please try to use existing CountdownWatcher#waitForConnected() function rather than SimpleSysTest#waitForConnect. Reference RemoveWatchesTest.java#L770
          2. I could see there are few improvements TestUtils.deleteFileRecursively done along with this jira, its really great work and I appreciate your initiatives. But I think this is not related to this jira. I'd suggest to push these changes through separate jira. IMHO, that would help the reviewers focusing only on the changes(minimal code changes) needed to handle the defect scenario and resolve the defect as early as possible.
          3. Generally constant sleeping in test case is problematic. One alternative approach I can quickly suggest is to use sliced sleeping in a while loop and check for new LE occurred. Each slice could be 500 milliseconds window and while loop can iterate/counter 20times or 30times before failing, probably would give overall bigger timeout.
            +        
            +        // give enough time, so that new leader election takes place
            +        Thread.sleep(5000);
            
          4. I hope the below exception is expected. In that case, it would be good to add Assert.fail("Must throw exception as____") statement before catching the exception. That would make the test more better.
            +        try {
            +            // do create operation, so that injected IOException is thrown
            +            zk.create(uniqueNode(), data.getBytes(), Ids.OPEN_ACL_UNSAFE,
            +                    CreateMode.PERSISTENT);
            //////////Include Assert.fail("Must throw exception as____") statement 
            +        } catch (Exception e) {
            +            /**
            +             * intended IOException might have already thrown. so at this point
            +             * of time connection may not be available
            +             */
            +            e.printStackTrace();
            +        }
            
          5. minor suggestion: I'd prefer to add assert message in the unit test for better readability/debugging. Please add assertion messages.
            assertNotNull(leader);
            assertNotNull(currentleader);
            assertEquals(uniqueNode, createNode);
            
          Show
          rakeshr Rakesh R added a comment - Good work Mohammad Arshad , thanks for taking care this. Overall patch looks nice, just few more comments, please take a look at it. Could you please try to use existing CountdownWatcher#waitForConnected() function rather than SimpleSysTest#waitForConnect. Reference RemoveWatchesTest.java#L770 I could see there are few improvements TestUtils.deleteFileRecursively done along with this jira, its really great work and I appreciate your initiatives. But I think this is not related to this jira. I'd suggest to push these changes through separate jira. IMHO, that would help the reviewers focusing only on the changes(minimal code changes) needed to handle the defect scenario and resolve the defect as early as possible. Generally constant sleeping in test case is problematic. One alternative approach I can quickly suggest is to use sliced sleeping in a while loop and check for new LE occurred. Each slice could be 500 milliseconds window and while loop can iterate/counter 20times or 30times before failing, probably would give overall bigger timeout. + + // give enough time, so that new leader election takes place + Thread .sleep(5000); I hope the below exception is expected. In that case, it would be good to add Assert.fail("Must throw exception as____") statement before catching the exception. That would make the test more better. + try { + // do create operation, so that injected IOException is thrown + zk.create(uniqueNode(), data.getBytes(), Ids.OPEN_ACL_UNSAFE, + CreateMode.PERSISTENT); //////////Include Assert.fail( "Must throw exception as____" ) statement + } catch (Exception e) { + /** + * intended IOException might have already thrown. so at this point + * of time connection may not be available + */ + e.printStackTrace(); + } minor suggestion: I'd prefer to add assert message in the unit test for better readability/debugging. Please add assertion messages. assertNotNull(leader); assertNotNull(currentleader); assertEquals(uniqueNode, createNode);
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          Thanks Rakesh R for your review comments.

          1. Changed test case base class to QuorumPeerTestBase which makes it easier to handle above comments. handled all comments.
          2. Created jira ZOOKEEPER-2306 for cleaning the test classes.
          Show
          arshad.mohammad Mohammad Arshad added a comment - Thanks Rakesh R for your review comments. Changed test case base class to QuorumPeerTestBase which makes it easier to handle above comments. handled all comments. Created jira ZOOKEEPER-2306 for cleaning the test classes.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12769223/ZOOKEEPER-2247-06.patch
          against trunk revision 1709293.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2932//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2932//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2932//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12769223/ZOOKEEPER-2247-06.patch against trunk revision 1709293. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2932//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2932//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2932//console This message is automatically generated.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Mohammad Arshad for the updates.

          Raul Gutierrez Segales, Flavio Junqueira do you also want to take another look at the latest patch since you had reviewed this earlier. Thanks!

          Show
          rakeshr Rakesh R added a comment - Thanks Mohammad Arshad for the updates. Raul Gutierrez Segales , Flavio Junqueira do you also want to take another look at the latest patch since you had reviewed this earlier. Thanks!
          Hide
          fpj Flavio Junqueira added a comment -

          Can we have the patch here somewhere for review, the review board or even github as a pull request? There are a few small things I want to point and it is easier with a review tool.

          One high-level point I wanted to raise is that I don't really like the separation between standalone and quorum member for the listener implementation. If the way we are shutting down servers and request processors needs adjustment, then lest's do it, but creating workarounds just makes it messier. For the standalone case, we are shutting down it in ZooKeeperServerMain, and I think the latest patch here is just trying to replicate how we shut down the server in ZooKeeperServerMain, which isn't good because it is duplicating code. We need to make this better before checking in and ideally not have two listener implementations.

          Show
          fpj Flavio Junqueira added a comment - Can we have the patch here somewhere for review, the review board or even github as a pull request? There are a few small things I want to point and it is easier with a review tool. One high-level point I wanted to raise is that I don't really like the separation between standalone and quorum member for the listener implementation. If the way we are shutting down servers and request processors needs adjustment, then lest's do it, but creating workarounds just makes it messier. For the standalone case, we are shutting down it in ZooKeeperServerMain, and I think the latest patch here is just trying to replicate how we shut down the server in ZooKeeperServerMain, which isn't good because it is duplicating code. We need to make this better before checking in and ideally not have two listener implementations.
          Hide
          rakeshr Rakesh R added a comment -
          Show
          rakeshr Rakesh R added a comment - ping Mohammad Arshad
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12769223/ZOOKEEPER-2247-06.patch
          against trunk revision 1720227.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 5 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3007//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3007//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3007//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12769223/ZOOKEEPER-2247-06.patch against trunk revision 1720227. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 5 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3007//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3007//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3007//console This message is automatically generated.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user arshadmohammad opened a pull request:

          https://github.com/apache/zookeeper/pull/52

          ZOOKEEPER-2247 Zookeeper service becomes unavailable when leader fail…

          …s to write transaction log

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/arshadmohammad/zookeeper ZOOKEEPER-2247

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/zookeeper/pull/52.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #52


          commit d0e8f91080a11a0be8031aaca2d6f2b631b0be17
          Author: Arshad Mohammad <arshad.mohammad.k@gmail.com>
          Date: 2016-01-18T03:20:13Z

          ZOOKEEPER-2247 Zookeeper service becomes unavailable when leader fails to write transaction log


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user arshadmohammad opened a pull request: https://github.com/apache/zookeeper/pull/52 ZOOKEEPER-2247 Zookeeper service becomes unavailable when leader fail… …s to write transaction log You can merge this pull request into a Git repository by running: $ git pull https://github.com/arshadmohammad/zookeeper ZOOKEEPER-2247 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/52.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #52 commit d0e8f91080a11a0be8031aaca2d6f2b631b0be17 Author: Arshad Mohammad <arshad.mohammad.k@gmail.com> Date: 2016-01-18T03:20:13Z ZOOKEEPER-2247 Zookeeper service becomes unavailable when leader fails to write transaction log
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Mohammad Arshad: thanks for the pull request! Patch lgtm, +1. I did leave some nits on GH... Other than that, I'd like to have this for 3.4.8 if others agree.

          Flavio Junqueira, Rakesh R, Chris Nauroth: mind taking another look?

          Show
          rgs Raul Gutierrez Segales added a comment - Mohammad Arshad : thanks for the pull request! Patch lgtm, +1. I did leave some nits on GH... Other than that, I'd like to have this for 3.4.8 if others agree. Flavio Junqueira , Rakesh R , Chris Nauroth : mind taking another look?
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          Thanks Raul Gutierrez Segales for your comments, I handled it at GH

          Show
          arshad.mohammad Mohammad Arshad added a comment - Thanks Raul Gutierrez Segales for your comments, I handled it at GH
          Hide
          fpj Flavio Junqueira added a comment -

          I can't convince myself that this is the right way to fix this issue. In fact, the critical thread addition might not have been done in the way I'd expect. If you check the run method in QuorumPeer, around line 1093, I'd expect that we deal with the problem reported here by simply returning from observeLeader(), followLeader(), and lead(). Instead, it sounds like the handleException() call is taking a different call path and the finally blocks for observeLeader(), followLeader(), and lead() aren't being executed, which would have avoided this present issue in the first place. Could anyone explain to me why we aren't simply relying on the finally blocks? If we can do it, I'd much rather have this option implemented rather than multiple code paths that change the state of the server.

          Show
          fpj Flavio Junqueira added a comment - I can't convince myself that this is the right way to fix this issue. In fact, the critical thread addition might not have been done in the way I'd expect. If you check the run method in QuorumPeer, around line 1093, I'd expect that we deal with the problem reported here by simply returning from observeLeader() , followLeader() , and lead() . Instead, it sounds like the handleException() call is taking a different call path and the finally blocks for observeLeader() , followLeader() , and lead() aren't being executed, which would have avoided this present issue in the first place. Could anyone explain to me why we aren't simply relying on the finally blocks? If we can do it, I'd much rather have this option implemented rather than multiple code paths that change the state of the server.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Flavio for pointing out the multiple execution paths.

          Could anyone explain to me why we aren't simply relying on the finally blocks?

          When there is an uncaught exception thrown by any of the internal critical threads, QuourmPeer doesn't have any mechanism to know that internal error state. He still continue with the #readPacket(). For example, Follower.java#L88 will continue reading without knowing that error. To execute the finally blocks there should be a way to stop this reading logic. So as part of ZOOKEEPER-1907 design discussions, the point has come up to introduce a listening mechanism which will take action and gracefully bring down the QuourmPeer. This made another execution path that change the state of the server.

          If we can do it, I'd much rather have this option implemented rather than multiple code paths that change the state of the server.

          I understand your point. How about introducing a polling mechanism at QuorumPeer. Presently ZooKeeperServerListener is taking the decision to shutdown the server, instead of this ZooKeeperServerListener will just mark the internal error state only. Later while polling QuorumPeer will see this error and exits the loop gracefully.

          The idea is something like, ZooKeeper server will maintain an internalErrorState, which will be then used by the QuorumPeer while reading the packet. If QuorumPeer sees an error then will break and executes the finally block. On the other side, all the threads will use ZooKeeperServerListener. He will listen the unexpected errors and notify the QuourmPeer about that error by setting zk.setInternalErrorState(true) to true.

          QuourmPeer should have a logic like,

              while (self.isRunning() && !zk.hasInternalError()) {
                  readPacket(qp);
                  processPacket(qp);
              }
          

          Similar polling mechanism has to be introduced at the standalone server ZooKeeperServerMain.java#L149 as well.

          I don't think we need to worry about the other internal exceptions which can occur before the ZK server enters into the #readPacket() state Follower.java#L88. I hope all these errors will come out and stops the server gracefully. Please correct me if I'm missing any other cases.

          Show
          rakeshr Rakesh R added a comment - Thanks Flavio for pointing out the multiple execution paths. Could anyone explain to me why we aren't simply relying on the finally blocks? When there is an uncaught exception thrown by any of the internal critical threads, QuourmPeer doesn't have any mechanism to know that internal error state. He still continue with the #readPacket(). For example, Follower.java#L88 will continue reading without knowing that error. To execute the finally blocks there should be a way to stop this reading logic. So as part of ZOOKEEPER-1907 design discussions, the point has come up to introduce a listening mechanism which will take action and gracefully bring down the QuourmPeer. This made another execution path that change the state of the server. If we can do it, I'd much rather have this option implemented rather than multiple code paths that change the state of the server. I understand your point. How about introducing a polling mechanism at QuorumPeer. Presently ZooKeeperServerListener is taking the decision to shutdown the server, instead of this ZooKeeperServerListener will just mark the internal error state only. Later while polling QuorumPeer will see this error and exits the loop gracefully. The idea is something like, ZooKeeper server will maintain an internalErrorState , which will be then used by the QuorumPeer while reading the packet. If QuorumPeer sees an error then will break and executes the finally block. On the other side, all the threads will use ZooKeeperServerListener. He will listen the unexpected errors and notify the QuourmPeer about that error by setting zk.setInternalErrorState(true) to true. QuourmPeer should have a logic like, while (self.isRunning() && !zk.hasInternalError()) { readPacket(qp); processPacket(qp); } Similar polling mechanism has to be introduced at the standalone server ZooKeeperServerMain.java#L149 as well. I don't think we need to worry about the other internal exceptions which can occur before the ZK server enters into the #readPacket() state Follower.java#L88 . I hope all these errors will come out and stops the server gracefully. Please correct me if I'm missing any other cases.
          Hide
          fpj Flavio Junqueira added a comment -

          Rakesh R It sounds ok to add to the predicate a call to !zk.hasInternalError() as you propose, but why can't we simply make self.isRunning() return false in the case of an error by setting running to false? That's what we want, that the server stops running in the case of an error, right?

          QuorumPeer.isRunning() returns the value of QuorumPeer.running, which is the condition to keep running the main loop, so we don't want to set it to false. It sounds like using QuorumPeer.isRunning() as is with follower, observer, learner, and leader isn't great because there are scenarios (like the one discussed here) in which we want to shutdown a participant/observer, but not the quorum peer. We may want to have a isRunning() for the follower, observer, learner, and leader classes that returns something like running && !zk.hasInternalError(). We may need to implement a isRunning() method for each one of those classes because they might eventually have different predicates to determine whether they are running or not.

          Does it make sense?

          Show
          fpj Flavio Junqueira added a comment - Rakesh R It sounds ok to add to the predicate a call to !zk.hasInternalError() as you propose, but why can't we simply make self.isRunning() return false in the case of an error by setting running to false? That's what we want, that the server stops running in the case of an error, right? QuorumPeer.isRunning() returns the value of QuorumPeer.running , which is the condition to keep running the main loop, so we don't want to set it to false. It sounds like using QuorumPeer.isRunning() as is with follower, observer, learner, and leader isn't great because there are scenarios (like the one discussed here) in which we want to shutdown a participant/observer, but not the quorum peer. We may want to have a isRunning() for the follower, observer, learner, and leader classes that returns something like running && !zk.hasInternalError() . We may need to implement a isRunning() method for each one of those classes because they might eventually have different predicates to determine whether they are running or not. Does it make sense?
          Hide
          rakeshr Rakesh R added a comment -

          The idea looks good. Agreed.

          I'm thinking about standalone/embedded server to implement zks.hasInternalError() logic. Here server waiting for thread.join() ZooKeeperServerMain.java#L149. Probably we may need to introduce another monitor thread to watch zkserver errors and interrupt the waiting threads. Whats your opinion?

          Mohammad Arshad Would you mind preparing a patch based on the above discussions.

          Show
          rakeshr Rakesh R added a comment - The idea looks good. Agreed. I'm thinking about standalone/embedded server to implement zks.hasInternalError() logic. Here server waiting for thread.join() ZooKeeperServerMain.java#L149 . Probably we may need to introduce another monitor thread to watch zkserver errors and interrupt the waiting threads. Whats your opinion? Mohammad Arshad Would you mind preparing a patch based on the above discussions.
          Hide
          fpj Flavio Junqueira added a comment -

          Guys, we need to wrap up 3.4.8, if we can't get a patch ready by the end of this week, I'd suggest we leave for the next release.

          Show
          fpj Flavio Junqueira added a comment - Guys, we need to wrap up 3.4.8, if we can't get a patch ready by the end of this week, I'd suggest we leave for the next release.
          Hide
          rakeshr Rakesh R added a comment -

          Flavio Junqueira Agreed. I've tried an attempt based on the above discussions. Please take a look at the latest patch. Thanks!

          Following are the changes:

          • Added #isRunning() at Learner & Leader
          • Added internalError flag at ZooKeeperServer
          • Added HealthMonitorThread in ZooKeeperServerMain. I think, for supporting embedded deployment, we may need to move this to ZooKeeperServer, right?
          Show
          rakeshr Rakesh R added a comment - Flavio Junqueira Agreed. I've tried an attempt based on the above discussions. Please take a look at the latest patch. Thanks! Following are the changes: Added #isRunning() at Learner & Leader Added internalError flag at ZooKeeperServer Added HealthMonitorThread in ZooKeeperServerMain. I think, for supporting embedded deployment, we may need to move this to ZooKeeperServer, right?
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12785089/ZOOKEEPER-2247-07.patch
          against trunk revision 1726354.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3018//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3018//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3018//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12785089/ZOOKEEPER-2247-07.patch against trunk revision 1726354. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3018//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3018//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3018//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12785133/ZOOKEEPER-2247-09.patch
          against trunk revision 1726354.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3019//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3019//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3019//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12785133/ZOOKEEPER-2247-09.patch against trunk revision 1726354. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3019//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3019//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3019//console This message is automatically generated.
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Per discussion in the mailing list, lets punt this to 3.4.9.

          cc: Flavio Junqueira

          Show
          rgs Raul Gutierrez Segales added a comment - Per discussion in the mailing list, lets punt this to 3.4.9. cc: Flavio Junqueira
          Hide
          cnauroth Chris Nauroth added a comment -

          In the latest patch, I only see the new HeartbeatMonitor thread started in standalone mode, because it's coupled with ZooKeeperServerMain. In the case of a full ensemble, QuorumPeerMain won't start the thread, because it won't call ZooKeeperServerMain (unless it's the degenerate case of a single-node ensemble).

          I've been trying to look for a way to solve this without adding another watchdog thread, but unfortunately I don't have another proposal to offer right now.

          Show
          cnauroth Chris Nauroth added a comment - In the latest patch, I only see the new HeartbeatMonitor thread started in standalone mode, because it's coupled with ZooKeeperServerMain . In the case of a full ensemble, QuorumPeerMain won't start the thread, because it won't call ZooKeeperServerMain (unless it's the degenerate case of a single-node ensemble). I've been trying to look for a way to solve this without adding another watchdog thread, but unfortunately I don't have another proposal to offer right now.
          Hide
          fpj Flavio Junqueira added a comment -

          Rakesh R Thanks for updating the patch. There are two things I believe we can improve here:

          1. Instead of doing this while (self.isRunning() && this.isRunning()), why don't you do this while (this.isRunning()) and check self.running() in this.isRunning()?
          2. I don't think we need the health monitor thread. It is just shutting down the cnxn factories and you could do it immediately after the server loops. For example, in Follower.java, add the cnxn factory shutdown calls after the {{while (this.isRunning()) {...}

            }}. Does it work?

          Show
          fpj Flavio Junqueira added a comment - Rakesh R Thanks for updating the patch. There are two things I believe we can improve here: Instead of doing this while (self.isRunning() && this.isRunning()) , why don't you do this while (this.isRunning()) and check self.running() in this.isRunning() ? I don't think we need the health monitor thread. It is just shutting down the cnxn factories and you could do it immediately after the server loops. For example, in Follower.java, add the cnxn factory shutdown calls after the {{while (this.isRunning()) {...} }}. Does it work?
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Chris Nauroth for the interest.

          In the latest patch, I only see the new HeartbeatMonitor thread started in standalone mode, because it's coupled with ZooKeeperServerMain. In the case of a full ensemble, QuorumPeerMain won't start the thread, because it won't call ZooKeeperServerMain (unless it's the degenerate case of a single-node ensemble).

          Monitor thread is not required for the quorum as they already have a server loop where the internalError checks has been integrated. But in case of standalone there is no loops instead it has #join() like I mentioned earlier ZooKeeperServerMain.java#L149. Thats the reason I've added a watch dog there.

          Show
          rakeshr Rakesh R added a comment - Thanks Chris Nauroth for the interest. In the latest patch, I only see the new HeartbeatMonitor thread started in standalone mode, because it's coupled with ZooKeeperServerMain. In the case of a full ensemble, QuorumPeerMain won't start the thread, because it won't call ZooKeeperServerMain (unless it's the degenerate case of a single-node ensemble). Monitor thread is not required for the quorum as they already have a server loop where the internalError checks has been integrated. But in case of standalone there is no loops instead it has #join() like I mentioned earlier ZooKeeperServerMain.java#L149 . Thats the reason I've added a watch dog there.
          Hide
          rakeshr Rakesh R added a comment -

          Per discussion in the mailing list, lets punt this to 3.4.9.

          Thanks for the patience Raul Gutierrez Segales . If we get a good solution will include this, otw agreed to postpone.

          Thanks Flavio Junqueira for the review comments.

          Instead of doing this while (self.isRunning() && this.isRunning()), why don't you do this while (this.isRunning()) and check self.running() in this.isRunning()?

          Agreed, will change it.

          I don't think we need the health monitor thread. It is just shutting down the cnxn factories and you could do it immediately after the server loops. For example, in Follower.java, add the cnxn factory shutdown calls after the 'while (this.isRunning())'. Does it work?

          The solution works well with ensemble. IIUC, there is no server loop in the standalone server. He just waits on #join ZooKeeperServerMain.java#L149. I'm not finding a way to interrupt this without a watchdog in standalone mode. Could you please correct me if I'm missing anything. Thanks!

          Show
          rakeshr Rakesh R added a comment - Per discussion in the mailing list, lets punt this to 3.4.9. Thanks for the patience Raul Gutierrez Segales . If we get a good solution will include this, otw agreed to postpone. Thanks Flavio Junqueira for the review comments. Instead of doing this while (self.isRunning() && this.isRunning()), why don't you do this while (this.isRunning()) and check self.running() in this.isRunning()? Agreed, will change it. I don't think we need the health monitor thread. It is just shutting down the cnxn factories and you could do it immediately after the server loops. For example, in Follower.java, add the cnxn factory shutdown calls after the 'while (this.isRunning())'. Does it work? The solution works well with ensemble. IIUC, there is no server loop in the standalone server. He just waits on #join ZooKeeperServerMain.java#L149 . I'm not finding a way to interrupt this without a watchdog in standalone mode. Could you please correct me if I'm missing anything. Thanks!
          Hide
          rakeshr Rakesh R added a comment -

          Attached another patch addressing while (self.isRunning() && this.isRunning()) comment. Also, added checks in the Leader, that was missed previously.

          Show
          rakeshr Rakesh R added a comment - Attached another patch addressing while (self.isRunning() && this.isRunning()) comment. Also, added checks in the Leader , that was missed previously.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12785335/ZOOKEEPER-2247-10.patch
          against trunk revision 1726354.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3024//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3024//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3024//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12785335/ZOOKEEPER-2247-10.patch against trunk revision 1726354. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3024//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3024//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3024//console This message is automatically generated.
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Thanks Rakesh R!

          Mind taking one more look Flavio Junqueira?

          Show
          rgs Raul Gutierrez Segales added a comment - Thanks Rakesh R ! Mind taking one more look Flavio Junqueira ?
          Hide
          cnauroth Chris Nauroth added a comment -

          Rakesh R, thank you for your patient explanations while I play catch-up understanding the problem and the patch. I think I see it now.

          The solution works well with ensemble. IIUC, there is no server loop in the standalone server.

          Yes, I see the key difference now. Standalone mode puts the main thread straight into ServerCnxnFactory#join, which does not react to these internal errors and therefore does not shut down. Ensemble mode puts the main thread into QuorumPeer#join, and QuorumPeer#run has a polling loop that is equipped to react to these internal errors.

          What if ZooKeeperServerMain was changed so that the main thread was put into a polling loop, similar to QuorumPeer? Essentially, I think this means taking the current patch's HealthMonitor code and putting it inline with the main thread's execution of ZooKeeperServerMain#runFromConfig before it tries to join. Would that achieve the same effect without needing the extra watchdog thread?

          Show
          cnauroth Chris Nauroth added a comment - Rakesh R , thank you for your patient explanations while I play catch-up understanding the problem and the patch. I think I see it now. The solution works well with ensemble. IIUC, there is no server loop in the standalone server. Yes, I see the key difference now. Standalone mode puts the main thread straight into ServerCnxnFactory#join , which does not react to these internal errors and therefore does not shut down. Ensemble mode puts the main thread into QuorumPeer#join , and QuorumPeer#run has a polling loop that is equipped to react to these internal errors. What if ZooKeeperServerMain was changed so that the main thread was put into a polling loop, similar to QuorumPeer ? Essentially, I think this means taking the current patch's HealthMonitor code and putting it inline with the main thread's execution of ZooKeeperServerMain#runFromConfig before it tries to join. Would that achieve the same effect without needing the extra watchdog thread?
          Hide
          fpj Flavio Junqueira added a comment -

          I was thinking the same thing, +1 to Chris Nauroth suggestion.

          Show
          fpj Flavio Junqueira added a comment - I was thinking the same thing, +1 to Chris Nauroth suggestion.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks a lot Chris Nauroth for the good suggestion. Uploaded another patch with this change.

          Chris Nauroth, Flavio Junqueira. I'm just adding a thought to know any integration issues. Now, with the new proposed solution, it will wait in while loop by periodically checking the status of the ZooKeeper server and will not execute the ServerCnxnFactory#join() function. I hope any of the ZooKeeper customers haven't overridden the ServerCnxnFactory#join() method and added extra functionality there.

          public abstract class ServerCnxnFactory {
          
                    public abstract void join() throws InterruptedException;
          
          Show
          rakeshr Rakesh R added a comment - Thanks a lot Chris Nauroth for the good suggestion. Uploaded another patch with this change. Chris Nauroth , Flavio Junqueira . I'm just adding a thought to know any integration issues. Now, with the new proposed solution, it will wait in while loop by periodically checking the status of the ZooKeeper server and will not execute the ServerCnxnFactory#join() function. I hope any of the ZooKeeper customers haven't overridden the ServerCnxnFactory#join() method and added extra functionality there. public abstract class ServerCnxnFactory { public abstract void join() throws InterruptedException;
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12785683/ZOOKEEPER-2247-11.patch
          against trunk revision 1726354.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3025//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3025//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3025//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12785683/ZOOKEEPER-2247-11.patch against trunk revision 1726354. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3025//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3025//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3025//console This message is automatically generated.
          Hide
          fpj Flavio Junqueira added a comment -

          Rakesh R what about the changes in the patch I'm uploading? Please keep in mind that we'll need a patch for 3.4, I'm assuming the patch for 3.5 will also apply to trunk and if not we'll need a third patch.

          Show
          fpj Flavio Junqueira added a comment - Rakesh R what about the changes in the patch I'm uploading? Please keep in mind that we'll need a patch for 3.4, I'm assuming the patch for 3.5 will also apply to trunk and if not we'll need a third patch.
          Hide
          rakeshr Rakesh R added a comment -

          OK, looks good. NonRecoverableErrorTest.java is missing, can we include this unit test to verify the internal error behavior of quorum?

          Show
          rakeshr Rakesh R added a comment - OK, looks good. NonRecoverableErrorTest.java is missing, can we include this unit test to verify the internal error behavior of quorum?
          Hide
          rakeshr Rakesh R added a comment -

          Adding one more observation:

          ZooKeeperServer.java
               public boolean isRunning() {
          -        return state == State.RUNNING;
          +        return (state == State.RUNNING) && !this.hasInternalError();
               }
          

          I'm thinking the above changes will affect the zks#shutdown() logic. Many places I could see it is skipping the shutdown logic by doing !isRunning() checks. Now, once the server hits an internal error, it will set the internalError=true flag. After this when shutting down it will skip the shutdown thinking server is already down, right?

          Existing shutdown logic like,
          LearnerZooKeeperServer.java#L161, ObserverZooKeeperServer.java#L135, ReadOnlyZooKeeperServer.java#L140

          ZooKeeperServer.java
          
              public synchronized void shutdown() {
                  if (!isRunning()) {
                      LOG.debug("ZooKeeper server is not running, so not proceeding to shutdown!");
                      return;
                  }
                  LOG.info("shutting down");
          
          Show
          rakeshr Rakesh R added a comment - Adding one more observation: ZooKeeperServer.java public boolean isRunning() { - return state == State.RUNNING; + return (state == State.RUNNING) && ! this .hasInternalError(); } I'm thinking the above changes will affect the zks#shutdown() logic. Many places I could see it is skipping the shutdown logic by doing !isRunning() checks. Now, once the server hits an internal error, it will set the internalError=true flag. After this when shutting down it will skip the shutdown thinking server is already down, right? Existing shutdown logic like, LearnerZooKeeperServer.java#L161 , ObserverZooKeeperServer.java#L135 , ReadOnlyZooKeeperServer.java#L140 ZooKeeperServer.java public synchronized void shutdown() { if (!isRunning()) { LOG.debug( "ZooKeeper server is not running, so not proceeding to shutdown!" ); return ; } LOG.info( "shutting down" );
          Hide
          fpj Flavio Junqueira added a comment -

          Rakesh R I wasn't proposing a final patch, just a change to your patch. I might have missed a file, so feel free to incorporate the changes to your patch. I'd rather have you proposing it so that I can review it.

          Show
          fpj Flavio Junqueira added a comment - Rakesh R I wasn't proposing a final patch, just a change to your patch. I might have missed a file, so feel free to incorporate the changes to your patch. I'd rather have you proposing it so that I can review it.
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Rakesh R, Flavio Junqueira: could we wrap this today please? Thanks!

          Show
          rgs Raul Gutierrez Segales added a comment - Rakesh R , Flavio Junqueira : could we wrap this today please? Thanks!
          Hide
          fpj Flavio Junqueira added a comment -

          You're absolutely right Rakesh R, it changes the behavior so we need to fix it.

          Here is my rationale. ZooKeeperServer.isRunning() should return true if the server is running. If there has been an error that made the server stop, then it isn't running, even if the state is RUNNING. There are a couple of options I see to fix this:

          1. We add a new state ERROR, which means that the server is in this limbo state, it isn't shut down but came across an internal error that made it stop. If the server is in this state, then we proceed with the shutdown logic you mention above. We would make the server transition to this state when we hit an error, and if we do it, then I think we don't need the hasInternalError() call any longer.
          2. We add a call like isStateRunning(), which is basically return state == State.RUNNING. If we do this, then we are essentially saying that isRunning() determines whether the server is running or not by checking the state and the internal error flag, while isStateRunning() simply determines whether the state of the server is State.RUNNING. We replace the if(!isRunning() in the code you mentioned above with if(!isStateRunning()).

          Option 1 sounds cleaner to me, but happy to hear opinions.

          Show
          fpj Flavio Junqueira added a comment - You're absolutely right Rakesh R , it changes the behavior so we need to fix it. Here is my rationale. ZooKeeperServer.isRunning() should return true if the server is running. If there has been an error that made the server stop, then it isn't running, even if the state is RUNNING . There are a couple of options I see to fix this: We add a new state ERROR , which means that the server is in this limbo state, it isn't shut down but came across an internal error that made it stop. If the server is in this state, then we proceed with the shutdown logic you mention above. We would make the server transition to this state when we hit an error, and if we do it, then I think we don't need the hasInternalError() call any longer. We add a call like isStateRunning() , which is basically return state == State.RUNNING . If we do this, then we are essentially saying that isRunning() determines whether the server is running or not by checking the state and the internal error flag, while isStateRunning() simply determines whether the state of the server is State.RUNNING . We replace the if(!isRunning() in the code you mentioned above with if(!isStateRunning()) . Option 1 sounds cleaner to me, but happy to hear opinions.
          Hide
          cnauroth Chris Nauroth added a comment -

          I hope any of the ZooKeeper customers haven't overridden the ServerCnxnFactory#join() method and added extra functionality there.

          I would find it very surprising if anyone was doing this. I don't believe this can be considered part of a public stable API. We don't publish the JavaDocs for it.

          Option 1 sounds cleaner to me, but happy to hear opinions.

          I agree. I think adding an ERROR state would model the problem more clearly.

          Show
          cnauroth Chris Nauroth added a comment - I hope any of the ZooKeeper customers haven't overridden the ServerCnxnFactory#join() method and added extra functionality there. I would find it very surprising if anyone was doing this. I don't believe this can be considered part of a public stable API. We don't publish the JavaDocs for it. Option 1 sounds cleaner to me, but happy to hear opinions. I agree. I think adding an ERROR state would model the problem more clearly.
          Hide
          rakeshr Rakesh R added a comment -

          I would find it very surprising if anyone was doing this. I don't believe this can be considered part of a public stable API. We don't publish the JavaDocs for it.

          Yes, you are correct. This is not published in the javadocs.

          Show
          rakeshr Rakesh R added a comment - I would find it very surprising if anyone was doing this. I don't believe this can be considered part of a public stable API. We don't publish the JavaDocs for it. Yes, you are correct. This is not published in the javadocs.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Flavio Junqueira, Chris Nauroth for the idea. Attached another patch with the new server state ERROR. Please take another look at the patch.

              public boolean isRunning() {
                  return state == State.RUNNING || state == State.ERROR;
              }
          
              public boolean isStateRunning() {
                  return state == State.RUNNING;
              }
          
          Show
          rakeshr Rakesh R added a comment - Thanks Flavio Junqueira , Chris Nauroth for the idea. Attached another patch with the new server state ERROR . Please take another look at the patch. public boolean isRunning() { return state == State.RUNNING || state == State.ERROR; } public boolean isStateRunning() { return state == State.RUNNING; }
          Hide
          rakeshr Rakesh R added a comment -

          OK

          Show
          rakeshr Rakesh R added a comment - OK
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12785927/ZOOKEEPER-2247-12.patch
          against trunk revision 1726354.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3026//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3026//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3026//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12785927/ZOOKEEPER-2247-12.patch against trunk revision 1726354. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3026//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3026//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3026//console This message is automatically generated.
          Hide
          cnauroth Chris Nauroth added a comment -

          Rakesh R, patch v12 looks good to me. I just have one comment.

                      // Watch status of ZooKeeper server. If there is an internal error
                      // then will do a graceful shutdown.
                      while (zkServer.isStateRunning()) {
                          try {
                              Thread.sleep(1000); // watch interval
                          } catch (InterruptedException ie) {
                              LOG.info("Thread interrupted");
                          }
                      }
          

          It's generally an anti-pattern to swallow InterruptedException, even though there is a lot of existing code in ZooKeeper and other codebases that does it. In this specific case, it would clear the interrupted status, and then that could potentially impact later code like ServerCnxnFactory#join that calls interruptable methods. Let's restore interrupted status in the catch block by calling Thread.currentThread().interrupt().

          I'll be +1 after that change. Thanks again for your diligence on this one!

          Show
          cnauroth Chris Nauroth added a comment - Rakesh R , patch v12 looks good to me. I just have one comment. // Watch status of ZooKeeper server. If there is an internal error // then will do a graceful shutdown. while (zkServer.isStateRunning()) { try { Thread .sleep(1000); // watch interval } catch (InterruptedException ie) { LOG.info( " Thread interrupted" ); } } It's generally an anti-pattern to swallow InterruptedException , even though there is a lot of existing code in ZooKeeper and other codebases that does it. In this specific case, it would clear the interrupted status, and then that could potentially impact later code like ServerCnxnFactory#join that calls interruptable methods. Let's restore interrupted status in the catch block by calling Thread.currentThread().interrupt() . I'll be +1 after that change. Thanks again for your diligence on this one!
          Hide
          fpj Flavio Junqueira added a comment -

          Rakesh R One simple thing I'd like to have changed is the timeout of the test cases. Can we please use 30s by default?

          I also had the same observation about the exception that Chris Nauroth made, and there are a couple of other things I don't understand. In this loop:

          +            while (zkServer.isStateRunning()) {
          +                try {
          +                    Thread.sleep(1000); // watch interval
          +                } catch (InterruptedException ie) {
          +                    LOG.info("Thread interrupted");
          +                }
          +            }
          

          Shouldn't it be while (zk.isRunning()) { instead?

          For the leader and learner, why is it isStateRunning here:

          +    public boolean isRunning() {
          +        return self.isRunning() && zk.isStateRunning();
          +    }
          

          and not this:

          +    public boolean isRunning() {
          +        return self.isRunning() && zk.isRunning();
          +    }
          

          The rationale is that we are running if both the peer is running and the server is running, so just checking if the state is running isn't sufficient.

          Show
          fpj Flavio Junqueira added a comment - Rakesh R One simple thing I'd like to have changed is the timeout of the test cases. Can we please use 30s by default? I also had the same observation about the exception that Chris Nauroth made, and there are a couple of other things I don't understand. In this loop: + while (zkServer.isStateRunning()) { + try { + Thread.sleep(1000); // watch interval + } catch (InterruptedException ie) { + LOG.info("Thread interrupted"); + } + } Shouldn't it be while (zk.isRunning()) { instead? For the leader and learner, why is it isStateRunning here: + public boolean isRunning() { + return self.isRunning() && zk.isStateRunning(); + } and not this: + public boolean isRunning() { + return self.isRunning() && zk.isRunning(); + } The rationale is that we are running if both the peer is running and the server is running, so just checking if the state is running isn't sufficient.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Flavio Junqueira for the comments.

          Can we please use 30s by default?

          Agreed.

          Shouldn't it be while (zk.isRunning()) instead?

              public boolean isRunning() {
                  return state == State.RUNNING || state == State.ERROR;
              }
          
              public boolean isStateRunning() {
                  return state == State.RUNNING;
              }
          

          State transitions:
          1. At the beginning the state will be INITIAL.
          2. After the successful start, update the server state to RUNNING
          3. When there is an internal error, update the server state to ERROR.
          4. On shutdown, update the server state to SHUTDOWN

          Standlone server watch logic:
          The newly added watch logic will periodically checks RUNNING state and come out of the loop if it sees a state other than RUNNING. With zks.isRunning() method, it will return true if server is RUNNING or ERROR state. So if I use isRunning(), it will never come out of the loop on error situations, right?

          For the leader and learner, why is it isStateRunning here:

          Here also the same case. It should come out of the readPacket function if the server is not RUNNING. With zks.isRunning(), it will never identify the ERROR state and continue reading packet, right?

          isRunning method is reflecting dual state, running as well as running with an error, I think that causes the confusion. I failed to find a better name for this function.

          Show
          rakeshr Rakesh R added a comment - Thanks Flavio Junqueira for the comments. Can we please use 30s by default? Agreed. Shouldn't it be while (zk.isRunning()) instead? public boolean isRunning() { return state == State.RUNNING || state == State.ERROR; } public boolean isStateRunning() { return state == State.RUNNING; } State transitions: 1. At the beginning the state will be INITIAL . 2. After the successful start, update the server state to RUNNING 3. When there is an internal error, update the server state to ERROR . 4. On shutdown, update the server state to SHUTDOWN Standlone server watch logic: The newly added watch logic will periodically checks RUNNING state and come out of the loop if it sees a state other than RUNNING . With zks.isRunning() method, it will return true if server is RUNNING or ERROR state. So if I use isRunning() , it will never come out of the loop on error situations, right? For the leader and learner, why is it isStateRunning here: Here also the same case. It should come out of the readPacket function if the server is not RUNNING. With zks.isRunning() , it will never identify the ERROR state and continue reading packet, right? isRunning method is reflecting dual state, running as well as running with an error , I think that causes the confusion. I failed to find a better name for this function.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Chris Nauroth, good catch. I will include this in my next patch.

          Show
          rakeshr Rakesh R added a comment - Thanks Chris Nauroth , good catch. I will include this in my next patch.
          Hide
          fpj Flavio Junqueira added a comment -

          Rakesh R Thanks for the clarification, but I'm still finding the predicates a bit confusing, please bear with me. isRunning() should return true if the server is running and the main loop should keep going as long as the call to isRunning() returns true. If there is an error in one of the processors, then the server isn't really running and we want the main loop to exit if the server isn't running.

          I proposed isStateRunning before because in the shutdown methods you pointed out above for learner, observer, and RO we need to know if the server needs shutdown or not. However, it sounds like it would be better to have a call like needsShutdown() instead of isStateRunning, which looks like return state == State.RUNNING || state == State.ERROR. The method isRunning() should go back to state == State.RUNNING.

          Let me know if this makes sense.

          Show
          fpj Flavio Junqueira added a comment - Rakesh R Thanks for the clarification, but I'm still finding the predicates a bit confusing, please bear with me. isRunning() should return true if the server is running and the main loop should keep going as long as the call to isRunning() returns true. If there is an error in one of the processors, then the server isn't really running and we want the main loop to exit if the server isn't running. I proposed isStateRunning before because in the shutdown methods you pointed out above for learner, observer, and RO we need to know if the server needs shutdown or not. However, it sounds like it would be better to have a call like needsShutdown() instead of isStateRunning , which looks like return state == State.RUNNING || state == State.ERROR . The method isRunning() should go back to state == State.RUNNING . Let me know if this makes sense.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Flavio Junqueira for the detailed explanation. This looks good, clear semantics. I'll soon prepare a patch with this.

          Show
          rakeshr Rakesh R added a comment - Thanks Flavio Junqueira for the detailed explanation. This looks good, clear semantics. I'll soon prepare a patch with this.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12786193/ZOOKEEPER-2247-13.patch
          against trunk revision 1726354.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3027//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3027//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3027//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12786193/ZOOKEEPER-2247-13.patch against trunk revision 1726354. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3027//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3027//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3027//console This message is automatically generated.
          Hide
          fpj Flavio Junqueira added a comment -

          Rakesh R The patch is looking much better now. There are a couple of small points that I think we can still fix:

          1. Could you review the methods you're adding and remove the public modifier if it isn't necessary? For example, a method to set the state shouldn't really be public, it should be at least package protected if not protected/private. I know we aren't super consistent about the modifiers in our code base, but we should try to improve it when possible.
          2. Did you mean to have an annotation here // VisibleForTesting? Perhaps you should just have a comment that this method exists for testing.
          Show
          fpj Flavio Junqueira added a comment - Rakesh R The patch is looking much better now. There are a couple of small points that I think we can still fix: Could you review the methods you're adding and remove the public modifier if it isn't necessary? For example, a method to set the state shouldn't really be public, it should be at least package protected if not protected/private. I know we aren't super consistent about the modifiers in our code base, but we should try to improve it when possible. Did you mean to have an annotation here // VisibleForTesting ? Perhaps you should just have a comment that this method exists for testing.
          Hide
          rakeshr Rakesh R added a comment -

          1 >> Agreed, I'll revisit the access specifiers and correct it.
          2 >> // VisibleForTesting I could see similar comments exists in our code. I referred these and thought of using similar pattern ContainerManager.java#L134, PurgeTxnLog.java#L78, ZooKeeper.java#L1011. Can we follow this pattern now and later if requires can replace together ?

          Show
          rakeshr Rakesh R added a comment - 1 >> Agreed, I'll revisit the access specifiers and correct it. 2 >> // VisibleForTesting I could see similar comments exists in our code. I referred these and thought of using similar pattern ContainerManager.java#L134 , PurgeTxnLog.java#L78 , ZooKeeper.java#L1011 . Can we follow this pattern now and later if requires can replace together ?
          Hide
          fpj Flavio Junqueira added a comment -

          Rakesh R Sure, we can revisit it later, perhaps create a jira so that we don't forget? Keep in mind that we will need patches for all three branches, please.

          Chris Nauroth are you +1 on this patch? It'd be good to give it a last look before we check this in.

          Show
          fpj Flavio Junqueira added a comment - Rakesh R Sure, we can revisit it later, perhaps create a jira so that we don't forget? Keep in mind that we will need patches for all three branches, please. Chris Nauroth are you +1 on this patch? It'd be good to give it a last look before we check this in.
          Hide
          rakeshr Rakesh R added a comment -
          ZooKeeperServerMain.java
          +            // connection factory will take care of shutting down rest
          +            // of the services
          +            if (cnxnFactory != null) {
          +                cnxnFactory.shutdown();
          +            }
          +            if (secureCnxnFactory != null) {
          +                secureCnxnFactory.shutdown();
          +            }
          

          I've just noticed one more thing. The above graceful shutdown logic in my patch is leaving containerManager and adminServer references. Instead of shutdown each entity separately, how about just make a ZooKeeperServerMain#shutdown() call like below?

          +            while (zkServer.isRunning()) {
          +               try {
          +                    Thread.sleep(1000); // watch interval
          +               } catch (InterruptedException ie) {
          +                    Thread.currentThread().interrupt();
          +                   LOG.info("Thread interrupted");
          +               }
          +           }
          +
          +          shutdown();
          
                      if (cnxnFactory != null) {
                          cnxnFactory.join();
                      }
                     // ......
                     // .....
          
          Show
          rakeshr Rakesh R added a comment - ZooKeeperServerMain.java + // connection factory will take care of shutting down rest + // of the services + if (cnxnFactory != null ) { + cnxnFactory.shutdown(); + } + if (secureCnxnFactory != null ) { + secureCnxnFactory.shutdown(); + } I've just noticed one more thing. The above graceful shutdown logic in my patch is leaving containerManager and adminServer references. Instead of shutdown each entity separately, how about just make a ZooKeeperServerMain#shutdown() call like below? + while (zkServer.isRunning()) { + try { + Thread .sleep(1000); // watch interval + } catch (InterruptedException ie) { + Thread .currentThread().interrupt(); + LOG.info( " Thread interrupted" ); + } + } + + shutdown(); if (cnxnFactory != null ) { cnxnFactory.join(); } // ...... // .....
          Hide
          rakeshr Rakesh R added a comment -

          Thanks for the advice. I just created ZOOKEEPER-2361 to track this. Sure, once I get +1s, will upload patch for respective branches.

          Show
          rakeshr Rakesh R added a comment - Thanks for the advice. I just created ZOOKEEPER-2361 to track this. Sure, once I get +1s, will upload patch for respective branches.
          Hide
          fpj Flavio Junqueira added a comment -

          By leaving containerManager and adminServer references, do you mean not stopping and shutting down the respective instances? I think you're proposing to replace this:

          ZooKeeperServerMain.java
          +            // connection factory will take care of shutting down rest
          +            // of the services
          +            if (cnxnFactory != null) {
          +                cnxnFactory.shutdown();
          +            }
          +            if (secureCnxnFactory != null) {
          +                secureCnxnFactory.shutdown();
          +            }
          

          with a call to the existing method ZooKeeperServerMain.shutdown(), is that right? I haven't checked if calling adminServer.shutdown() can produce any issue like an NPE here, but it doesn't look like.

          Show
          fpj Flavio Junqueira added a comment - By leaving containerManager and adminServer references, do you mean not stopping and shutting down the respective instances? I think you're proposing to replace this: ZooKeeperServerMain.java + // connection factory will take care of shutting down rest + // of the services + if (cnxnFactory != null) { + cnxnFactory.shutdown(); + } + if (secureCnxnFactory != null) { + secureCnxnFactory.shutdown(); + } with a call to the existing method ZooKeeperServerMain.shutdown() , is that right? I haven't checked if calling adminServer.shutdown() can produce any issue like an NPE here, but it doesn't look like.
          Hide
          rakeshr Rakesh R added a comment -

          By leaving containerManager and adminServer references, do you mean not stopping and shutting down the respective instances?

          exactly, stopping containerManager and adminServer.

          with a call to the existing method ZooKeeperServerMain.shutdown(), is that right? I haven't checked if calling adminServer.shutdown() can produce any issue like an NPE here, but it doesn't look like.

          Yes, in addition to the above ZooKeeperServerMain.shutdown() replacement, am thinking to add a new running flag in ZooKeeperServerMain class to avoid multiple shutdown calls. While shutdown, it will first check whether the main#shutdown() is already invoked. Does this makes sense?

          Also, I feel it is safe to add extra null check adminServer != null. I will include this also in my next patch.

          Show
          rakeshr Rakesh R added a comment - By leaving containerManager and adminServer references, do you mean not stopping and shutting down the respective instances? exactly, stopping containerManager and adminServer . with a call to the existing method ZooKeeperServerMain.shutdown(), is that right? I haven't checked if calling adminServer.shutdown() can produce any issue like an NPE here, but it doesn't look like. Yes, in addition to the above ZooKeeperServerMain.shutdown() replacement, am thinking to add a new running flag in ZooKeeperServerMain class to avoid multiple shutdown calls. While shutdown, it will first check whether the main#shutdown() is already invoked. Does this makes sense? Also, I feel it is safe to add extra null check adminServer != null . I will include this also in my next patch.
          Hide
          fpj Flavio Junqueira added a comment -

          am thinking to add a new running flag in ZooKeeperServerMain class to avoid multiple shutdown calls.

          Avoiding multiple shutdown calls sounds good, even though I'd expect the shutdown call to be idempotent. I'd rather avoid having yet another flag if possible, though.

          Show
          fpj Flavio Junqueira added a comment - am thinking to add a new running flag in ZooKeeperServerMain class to avoid multiple shutdown calls. Avoiding multiple shutdown calls sounds good, even though I'd expect the shutdown call to be idempotent. I'd rather avoid having yet another flag if possible, though.
          Hide
          rakeshr Rakesh R added a comment -

          Attached another patch with above mentioned ZooKeeperServerMain.shutdown() changes. Please review it again. Thanks!

          Show
          rakeshr Rakesh R added a comment - Attached another patch with above mentioned ZooKeeperServerMain.shutdown() changes. Please review it again. Thanks!
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12786408/ZOOKEEPER-2247-14.patch
          against trunk revision 1728577.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3031//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3031//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3031//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12786408/ZOOKEEPER-2247-14.patch against trunk revision 1728577. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3031//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3031//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3031//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12786418/ZOOKEEPER-2247-15.patch
          against trunk revision 1728577.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3032//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3032//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3032//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12786418/ZOOKEEPER-2247-15.patch against trunk revision 1728577. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3032//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3032//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3032//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12786428/ZOOKEEPER-2247-br-3.4.patch
          against trunk revision 1728577.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3033//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12786428/ZOOKEEPER-2247-br-3.4.patch against trunk revision 1728577. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3033//console This message is automatically generated.
          Hide
          rakeshr Rakesh R added a comment -

          Flavio Junqueira, Chris Nauroth could you review it again once you get a chance, thanks!

          Show
          rakeshr Rakesh R added a comment - Flavio Junqueira , Chris Nauroth could you review it again once you get a chance, thanks!
          Hide
          cnauroth Chris Nauroth added a comment -

          Hello Rakesh R. This looks good, and I am +1 for v15 of the trunk patch.

          I have one question on the branch-3.4 patch. I see this is removing the System#exit call from SyncRequestProcessor. For trunk, that call was already removed as part of ZOOKEEPER-602. Was there some specific reason that this wasn't removed during the ZOOKEEPER-602 backport to branch-3.4, or do you think it was just an oversight?

          Show
          cnauroth Chris Nauroth added a comment - Hello Rakesh R . This looks good, and I am +1 for v15 of the trunk patch. I have one question on the branch-3.4 patch. I see this is removing the System#exit call from SyncRequestProcessor . For trunk, that call was already removed as part of ZOOKEEPER-602 . Was there some specific reason that this wasn't removed during the ZOOKEEPER-602 backport to branch-3.4, or do you think it was just an oversight?
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Chris Nauroth for the reviews.

          I have one question on the branch-3.4 patch. I see this is removing the System#exit call from SyncRequestProcessor

          This was actually missed during the ZOOKEEPER-602 backport to branch-3.4. Now, I've got the chance to cleanup System#exit call. As we know with this patch it is giving another chance to do the re-election after hitting unrecoverable errors. So this call is not required.

          Show
          rakeshr Rakesh R added a comment - Thanks Chris Nauroth for the reviews. I have one question on the branch-3.4 patch. I see this is removing the System#exit call from SyncRequestProcessor This was actually missed during the ZOOKEEPER-602 backport to branch-3.4 . Now, I've got the chance to cleanup System#exit call. As we know with this patch it is giving another chance to do the re-election after hitting unrecoverable errors. So this call is not required.
          Hide
          cnauroth Chris Nauroth added a comment -

          As we know with this patch it is giving another chance to do the re-election after hitting unrecoverable errors. So this call is not required.

          Great, this all lines up with my understanding. I just wanted to make sure there wasn't some edge case related to the System#exit call specific to branch-3.4.

          With that, I am now +1 for both the trunk/branch-3.5 patch and the branch-3.4 patch. Flavio Junqueira, would you like to take another look?

          Show
          cnauroth Chris Nauroth added a comment - As we know with this patch it is giving another chance to do the re-election after hitting unrecoverable errors. So this call is not required. Great, this all lines up with my understanding. I just wanted to make sure there wasn't some edge case related to the System#exit call specific to branch-3.4. With that, I am now +1 for both the trunk/branch-3.5 patch and the branch-3.4 patch. Flavio Junqueira , would you like to take another look?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user arshadmohammad closed the pull request at:

          https://github.com/apache/zookeeper/pull/52

          Show
          githubbot ASF GitHub Bot added a comment - Github user arshadmohammad closed the pull request at: https://github.com/apache/zookeeper/pull/52
          Hide
          rakeshr Rakesh R added a comment -

          Flavio Junqueira, would be great to see your feedback. Please take a look at the patch when you get some time. Thanks!

          Show
          rakeshr Rakesh R added a comment - Flavio Junqueira , would be great to see your feedback. Please take a look at the patch when you get some time. Thanks!
          Hide
          fpj Flavio Junqueira added a comment -

          I'll have a look, but why was the pull request closed? I like reviewing on github and also I had some comments there. Were those comments addressed?

          Show
          fpj Flavio Junqueira added a comment - I'll have a look, but why was the pull request closed? I like reviewing on github and also I had some comments there. Were those comments addressed?
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user rakeshadr opened a pull request:

          https://github.com/apache/zookeeper/pull/65

          ZOOKEEPER-2247

          Created this PR using proposed "ZOOKEEPER-2247-15.patch" in the jira.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/rakeshadr/zookeeper-1 ZOOKEEPER-2247

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/zookeeper/pull/65.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #65


          commit 27b66905a69e57cd21f225f998edea4e1812825d
          Author: Rakesh R <rakeshr@apache.org>
          Date: 2016-04-19T14:27:16Z

          ZOOKEEPER-2247


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user rakeshadr opened a pull request: https://github.com/apache/zookeeper/pull/65 ZOOKEEPER-2247 Created this PR using proposed " ZOOKEEPER-2247 -15.patch" in the jira. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rakeshadr/zookeeper-1 ZOOKEEPER-2247 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/zookeeper/pull/65.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #65 commit 27b66905a69e57cd21f225f998edea4e1812825d Author: Rakesh R <rakeshr@apache.org> Date: 2016-04-19T14:27:16Z ZOOKEEPER-2247
          Hide
          rakeshr Rakesh R added a comment -

          PR zookeeper/pull/52 was initially created by Arshad. Later the solution is changed and I hope thats the reason he closed the PR. I've created another PR with the latest patch ZOOKEEPER-2247-15.patch for reviews. I think I've addressed your comments which are applicable to the new solution. Following are the review comments I've referred from previous PR zookeeper/pull/52
          1) Can we make this timeout shorter and use the same 30000 we use with others? => DONE
          2) minor: "... so not proceeding to shutdown!" -> "... not proceeding with shutdown." => there is no code changes in new solution.
          3) minor: can we make this uniform and have "x != null"? => DONE
          4) The parent class also has a private member that is a listener. Do we really need two listeners for a QuorumZooKeeperServer instance? => DONE

          Show
          rakeshr Rakesh R added a comment - PR zookeeper/pull/52 was initially created by Arshad. Later the solution is changed and I hope thats the reason he closed the PR. I've created another PR with the latest patch ZOOKEEPER-2247 -15.patch for reviews. I think I've addressed your comments which are applicable to the new solution. Following are the review comments I've referred from previous PR zookeeper/pull/52 1) Can we make this timeout shorter and use the same 30000 we use with others? => DONE 2) minor: "... so not proceeding to shutdown!" -> "... not proceeding with shutdown." => there is no code changes in new solution. 3) minor: can we make this uniform and have "x != null"? => DONE 4) The parent class also has a private member that is a listener. Do we really need two listeners for a QuorumZooKeeperServer instance? => DONE
          Hide
          fpj Flavio Junqueira added a comment -

          Thanks Rakesh R.

          Show
          fpj Flavio Junqueira added a comment - Thanks Rakesh R .
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12812810/ZOOKEEPER-2247-16.patch
          against trunk revision 1748630.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 10 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3242//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12812810/ZOOKEEPER-2247-16.patch against trunk revision 1748630. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3242//console This message is automatically generated.
          Hide
          rakeshr Rakesh R added a comment -

          Attached ZK-16.patch new patch addressing Flavio Junqueira's comments given in the PR-65. I will prepare patch for 3.4 branch once I get +1 for the latest patch.

          Show
          rakeshr Rakesh R added a comment - Attached ZK-16.patch new patch addressing Flavio Junqueira 's comments given in the PR-65 . I will prepare patch for 3.4 branch once I get +1 for the latest patch.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12812810/ZOOKEEPER-2247-16.patch
          against trunk revision 1748630.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 10 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3244//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12812810/ZOOKEEPER-2247-16.patch against trunk revision 1748630. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 10 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3244//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12812823/ZOOKEEPER-2247-17.patch
          against trunk revision 1748630.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3245//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3245//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3245//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12812823/ZOOKEEPER-2247-17.patch against trunk revision 1748630. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3245//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3245//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3245//console This message is automatically generated.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12812843/ZOOKEEPER-2247-18.patch
          against trunk revision 1748630.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3247//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3247//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3247//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12812843/ZOOKEEPER-2247-18.patch against trunk revision 1748630. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3247//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3247//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3247//console This message is automatically generated.
          Hide
          rakeshr Rakesh R added a comment -

          Note: It seems the test case failure is not related to my patch, please ignore it.

          Show
          rakeshr Rakesh R added a comment - Note: It seems the test case failure is not related to my patch, please ignore it.
          Hide
          fpj Flavio Junqueira added a comment -

          +1, LGTM. Thanks, Rakesh R.

          Show
          fpj Flavio Junqueira added a comment - +1, LGTM. Thanks, Rakesh R .
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Flavio Junqueira, Rakesh R: sorry for the late comments. So, for the test case in ZooKeeperServerMainTest.java:

               /**
          +     * Test case for https://issues.apache.org/jira/browse/ZOOKEEPER-2247.
          +     * Test to verify that even after non recoverable error (error while
          +     * writing transaction log) on ZooKeeper service will be available
          +     */
          +    @Test(timeout = 30000)
          +    public void testNonRecoverableError() throws Exception {
          

          That's really not what's happening, given that we don't wait for the quorum to come back. We only wait for the injected failure to happen. Does this test case actually provide anything new to what we have for in NonRecoverableErrorTest.java? Am I missing some context?

          Show
          rgs Raul Gutierrez Segales added a comment - Flavio Junqueira , Rakesh R : sorry for the late comments. So, for the test case in ZooKeeperServerMainTest.java: /** + * Test case for https: //issues.apache.org/jira/browse/ZOOKEEPER-2247. + * Test to verify that even after non recoverable error (error while + * writing transaction log) on ZooKeeper service will be available + */ + @Test(timeout = 30000) + public void testNonRecoverableError() throws Exception { That's really not what's happening, given that we don't wait for the quorum to come back. We only wait for the injected failure to happen. Does this test case actually provide anything new to what we have for in NonRecoverableErrorTest.java? Am I missing some context?
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Another thing, this feels weird, in ZooKeeperServerStateListener.java:

          +class ZooKeeperServerStateListener {
          +    private final CountDownLatch shutdownLatch;
          +
          +    ZooKeeperServerStateListener(CountDownLatch shutdownLatch) {
          +        this.shutdownLatch = shutdownLatch;
          +    }
          +
          +    /**
          +     * This will be invoked when the server transition to a new server state.
          +     *
          +     * @param state new server state
          +     */
          +    void stateChanged(State state) {
          +        if (state != State.RUNNING) {
          +            shutdownLatch.countDown();
          +        }
          +    }
          +}
          

          I think the name is misleading, since it's only used to watch/count shutdown events. Should we name it appropriately then?

          Also, what happens if someone calls stateChanged(State.INITIAL), we'd still call shutdownLatch.countDown(). Should we not assert that doesn't happen?

          Show
          rgs Raul Gutierrez Segales added a comment - Another thing, this feels weird, in ZooKeeperServerStateListener.java: +class ZooKeeperServerStateListener { + private final CountDownLatch shutdownLatch; + + ZooKeeperServerStateListener(CountDownLatch shutdownLatch) { + this .shutdownLatch = shutdownLatch; + } + + /** + * This will be invoked when the server transition to a new server state. + * + * @param state new server state + */ + void stateChanged(State state) { + if (state != State.RUNNING) { + shutdownLatch.countDown(); + } + } +} I think the name is misleading, since it's only used to watch/count shutdown events. Should we name it appropriately then? Also, what happens if someone calls stateChanged(State.INITIAL), we'd still call shutdownLatch.countDown(). Should we not assert that doesn't happen?
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Minor nit:

          +    /**
          +     * This can be used while shutting down the server to see whether the server
          +     * is already shutdown or not.
          +     *
          +     * @return true if the server is running or server hits an error, false
          +     *         otherwise.
          +     */
          +    protected boolean needsShutdown() {
          +        return state == State.RUNNING || state == State.ERROR;
          +    }
          

          should probably be canShutdown(), given that if you are in State.RUNNING it's not like you need a shutdown.

          Show
          rgs Raul Gutierrez Segales added a comment - Minor nit: + /** + * This can be used while shutting down the server to see whether the server + * is already shutdown or not. + * + * @ return true if the server is running or server hits an error, false + * otherwise. + */ + protected boolean needsShutdown() { + return state == State.RUNNING || state == State.ERROR; + } should probably be canShutdown(), given that if you are in State.RUNNING it's not like you need a shutdown.
          Hide
          fpj Flavio Junqueira added a comment -

          Let's not rush into getting this one in, it would be good to fix it, but we can leave to 3.5.3.

          Show
          fpj Flavio Junqueira added a comment - Let's not rush into getting this one in, it would be good to fix it, but we can leave to 3.5.3.
          Hide
          rakeshr Rakesh R added a comment -

          Thank you Raul Gutierrez Segales for taking a look at this.

          NonRecoverableErrorTest extends QuorumPeerTestBase => This is testing the behavior in zookeeper quorum servers.

          ZooKeeperServerMainTest#testNonRecoverableError => This test class is for testing behavior in standalone server, thats the reason it is just injecting failure to the single standalone server and continue to next step.

          /**
           * Test stand-alone server.
           *
           */
          public class ZooKeeperServerMainTest extends ZKTestCase implements Watcher {
          

          Is anything required to be done?

          Show
          rakeshr Rakesh R added a comment - Thank you Raul Gutierrez Segales for taking a look at this. NonRecoverableErrorTest extends QuorumPeerTestBase => This is testing the behavior in zookeeper quorum servers. ZooKeeperServerMainTest#testNonRecoverableError => This test class is for testing behavior in standalone server, thats the reason it is just injecting failure to the single standalone server and continue to next step. /** * Test stand-alone server. * */ public class ZooKeeperServerMainTest extends ZKTestCase implements Watcher { Is anything required to be done?
          Hide
          rakeshr Rakesh R added a comment -

          I think the name is misleading, since it's only used to watch/count shutdown events. Should we name it appropriately then?

          Since this receives all the state change notifications I named it as StateListener.

          How about renaming to ZooKeeperServerShutdownListener or ZooKeeperServerShutdownHandler ?

          Also, what happens if someone calls stateChanged(State.INITIAL), we'd still call shutdownLatch.countDown(). Should we not assert that doesn't happen?

          Good point. I will change the condition to if ((state == State.ERROR) || (state == State.SHUTDOWN)). OK?

          Show
          rakeshr Rakesh R added a comment - I think the name is misleading, since it's only used to watch/count shutdown events. Should we name it appropriately then? Since this receives all the state change notifications I named it as StateListener . How about renaming to ZooKeeperServerShutdownListener or ZooKeeperServerShutdownHandler ? Also, what happens if someone calls stateChanged(State.INITIAL), we'd still call shutdownLatch.countDown(). Should we not assert that doesn't happen? Good point. I will change the condition to if ((state == State.ERROR) || (state == State.SHUTDOWN)) . OK?
          Hide
          rakeshr Rakesh R added a comment -

          OK, will use canShutdown

          Show
          rakeshr Rakesh R added a comment - OK, will use canShutdown
          Hide
          arshad.mohammad Mohammad Arshad added a comment -

          Rakesh R, Latest patch does not apply, Can you please rebase it.

          Show
          arshad.mohammad Mohammad Arshad added a comment - Rakesh R , Latest patch does not apply, Can you please rebase it.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Mohammad Arshad for the interest. I'm planning to prepare new patch once I get a reply/inputs from Raul Gutierrez Segales for the above set of review comments.

          Show
          rakeshr Rakesh R added a comment - Thanks Mohammad Arshad for the interest. I'm planning to prepare new patch once I get a reply/inputs from Raul Gutierrez Segales for the above set of review comments.
          Hide
          rakeshr Rakesh R added a comment -

          Hi Raul Gutierrez Segales, Few days back I've replied to your comments. It would be great to see your feedback and I will prepare final patch based on that. Thanks!

          Show
          rakeshr Rakesh R added a comment - Hi Raul Gutierrez Segales , Few days back I've replied to your comments. It would be great to see your feedback and I will prepare final patch based on that. Thanks!
          Hide
          rakeshr Rakesh R added a comment -

          Attached new patch addressing Raul Gutierrez Segales comments. Please review it again.

          Show
          rakeshr Rakesh R added a comment - Attached new patch addressing Raul Gutierrez Segales comments. Please review it again.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12818755/ZOOKEEPER-2247-19.patch
          against trunk revision 1750739.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3278//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3278//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3278//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12818755/ZOOKEEPER-2247-19.patch against trunk revision 1750739. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3278//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3278//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3278//console This message is automatically generated.
          Hide
          rakeshr Rakesh R added a comment -

          It looks like LETest.testLE failure is not related to the patch, please ignore it.

          Show
          rakeshr Rakesh R added a comment - It looks like LETest.testLE failure is not related to the patch, please ignore it.
          Hide
          rakeshr Rakesh R added a comment -

          ping Flavio Junqueira, Raul Gutierrez Segales, would be great to see your feedback on pushing this in. I've tried fixing all your comments in the latest patch. Kindly review it again. Thanks!

          Show
          rakeshr Rakesh R added a comment - ping Flavio Junqueira , Raul Gutierrez Segales , would be great to see your feedback on pushing this in. I've tried fixing all your comments in the latest patch. Kindly review it again. Thanks!
          Hide
          hanm Michael Han added a comment -

          This is a known flaky tests and is tracked by ZOOKEEPER-2483.

          Show
          hanm Michael Han added a comment - This is a known flaky tests and is tracked by ZOOKEEPER-2483 .
          Hide
          fpj Flavio Junqueira added a comment -
          Show
          fpj Flavio Junqueira added a comment - Is this PR updated, Rakesh R : https://github.com/apache/zookeeper/pull/65
          Hide
          rakeshr Rakesh R added a comment -

          I've updated the PR with the changes.

          Show
          rakeshr Rakesh R added a comment - I've updated the PR with the changes.
          Hide
          rgs Raul Gutierrez Segales added a comment -

          Rakesh R: sorry for dropping the ball here. Lgtm, +1. One nit though:

          +        if ((state == State.ERROR) || (state == State.SHUTDOWN)) {
          

          Drop the extra ()s around the state checks, it's readable enough without them.

          I can merge this once we have a +1 from Flavio as well. Thanks!

          Show
          rgs Raul Gutierrez Segales added a comment - Rakesh R : sorry for dropping the ball here. Lgtm, +1. One nit though: + if ((state == State.ERROR) || (state == State.SHUTDOWN)) { Drop the extra ()s around the state checks, it's readable enough without them. I can merge this once we have a +1 from Flavio as well. Thanks!
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Raul Gutierrez Segales for the reviews, attached new patch addressing the same. It seems separate patch for branch-3.4 is needed, I'll upload once it is ready for commit.

          Show
          rakeshr Rakesh R added a comment - Thanks Raul Gutierrez Segales for the reviews, attached new patch addressing the same. It seems separate patch for branch-3.4 is needed, I'll upload once it is ready for commit.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12821512/ZOOKEEPER-2247-20.patch
          against trunk revision 1754582.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3319//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3319//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3319//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12821512/ZOOKEEPER-2247-20.patch against trunk revision 1754582. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3319//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3319//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3319//console This message is automatically generated.
          Hide
          fpj Flavio Junqueira added a comment -

          Rakesh R hope you don't mind, I left just a few minor questions on github. It looks good, though.

          Show
          fpj Flavio Junqueira added a comment - Rakesh R hope you don't mind, I left just a few minor questions on github. It looks good, though.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Flavio Junqueira for the reviews. I've replied to those comments.

          Show
          rakeshr Rakesh R added a comment - Thanks Flavio Junqueira for the reviews. I've replied to those comments.
          Hide
          rakeshr Rakesh R added a comment -

          Attached new patch addressing Flavio Junqueira's comments given in Github_PR_65. Thanks!

          Show
          rakeshr Rakesh R added a comment - Attached new patch addressing Flavio Junqueira 's comments given in Github_PR_65 . Thanks!
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12822331/ZOOKEEPER-2247-21.patch
          against trunk revision 1755100.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3326//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3326//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3326//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12822331/ZOOKEEPER-2247-21.patch against trunk revision 1755100. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3326//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3326//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3326//console This message is automatically generated.
          Hide
          rakeshr Rakesh R added a comment -

          Flavio Junqueira, I'll prepare branch-3-4 patch once I get +1 for the latest trunk patch. Kindly take a look at the attached patch, Thanks!

          Show
          rakeshr Rakesh R added a comment - Flavio Junqueira , I'll prepare branch-3-4 patch once I get +1 for the latest trunk patch. Kindly take a look at the attached patch, Thanks!
          Hide
          fpj Flavio Junqueira added a comment -

          The patch looks good, it is pretty much ready to go. Since you'll have to generate at least the 3.4 patch, I'll ask you to fix a couple of small things, hope you don't mind:

          I have extended the javadoc of `setState`, would you mind using this one instead:

          /**
          +     * Sets the state of ZooKeeper server. After changing the state, it
          +     * notifies the server state change to a registered shutdown handler,
          +     * if any.
          +     * <p>
          +     * The following are the server state transitions:
          +     * <li>During startup the server will be in the INITIAL state.</li>
          +     * <li>After successfully starting, the server sets the state to
          +     * RUNNING.</li>
          +     * <li> The server transitions to the ERROR state if it hits an internal error.
          +     * {@link ZooKeeperServerListenerImpl} notifies any critical resource error
          +     * events, e.g., SyncRequestProcessor not being able to write a txn to disk.</li>
          +     * <li>During shutdown the server sets the state to SHUTDOWN, which
          +     * corresponds to the server not running.</li>
          +     *
          +     * @param state new server state.
          +     */
          

          The same for the javadoc of ZooKeeperServerShutdownHandler:

          +/**
          + * ZooKeeper server shutdown handler which will be used to handle ERROR or
          + * SHUTDOWN server state transitions, which in turn releases the associated shutdown
          + * latch.
          + */
          

          Finally, in waitForNewLeaderElection, would you mind setting the Thread.sleep to sleep for only 100ms each time? Not sure if we need to increase the counter, but I'd rather reduce the duration of each iteration.

          Otherwise, it looks very good, thanks for bearing with all the comments and working hard to get it in good shape. If you make these changes and generate the branch patches, I'll check this one in.

          Show
          fpj Flavio Junqueira added a comment - The patch looks good, it is pretty much ready to go. Since you'll have to generate at least the 3.4 patch, I'll ask you to fix a couple of small things, hope you don't mind: I have extended the javadoc of `setState`, would you mind using this one instead: /** + * Sets the state of ZooKeeper server. After changing the state, it + * notifies the server state change to a registered shutdown handler, + * if any. + * <p> + * The following are the server state transitions: + * <li>During startup the server will be in the INITIAL state.</li> + * <li>After successfully starting, the server sets the state to + * RUNNING.</li> + * <li> The server transitions to the ERROR state if it hits an internal error. + * {@link ZooKeeperServerListenerImpl} notifies any critical resource error + * events, e.g., SyncRequestProcessor not being able to write a txn to disk.</li> + * <li>During shutdown the server sets the state to SHUTDOWN, which + * corresponds to the server not running.</li> + * + * @param state new server state. + */ The same for the javadoc of ZooKeeperServerShutdownHandler : +/** + * ZooKeeper server shutdown handler which will be used to handle ERROR or + * SHUTDOWN server state transitions, which in turn releases the associated shutdown + * latch. + */ Finally, in waitForNewLeaderElection , would you mind setting the Thread.sleep to sleep for only 100ms each time? Not sure if we need to increase the counter, but I'd rather reduce the duration of each iteration. Otherwise, it looks very good, thanks for bearing with all the comments and working hard to get it in good shape. If you make these changes and generate the branch patches, I'll check this one in.
          Hide
          fpj Flavio Junqueira added a comment -

          +1, pending jenkins run.

          Show
          fpj Flavio Junqueira added a comment - +1, pending jenkins run.
          Hide
          hadoopqa Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12822541/ZOOKEEPER-2247-22.patch
          against trunk revision 1755379.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3331//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3331//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3331//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12822541/ZOOKEEPER-2247-22.patch against trunk revision 1755379. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3331//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3331//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3331//console This message is automatically generated.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks Flavio Junqueira for the detailed reviews and comments. I'll upload br-3-4 patch shortly.

          Show
          rakeshr Rakesh R added a comment - Thanks Flavio Junqueira for the detailed reviews and comments. I'll upload br-3-4 patch shortly.
          Hide
          rakeshr Rakesh R added a comment -

          Attached trunk(made one minor javadoc correction in ZooKeeperServerMainTest.java test class) and branch-3-4 patches.

          Show
          rakeshr Rakesh R added a comment - Attached trunk(made one minor javadoc correction in ZooKeeperServerMainTest.java test class) and branch-3-4 patches.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12822559/ZOOKEEPER-2247-br-3.4.patch
          against trunk revision 1755379.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 9 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3332//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12822559/ZOOKEEPER-2247-br-3.4.patch against trunk revision 1755379. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3332//console This message is automatically generated.
          Hide
          rakeshr Rakesh R added a comment -

          Re-attaching the trunk patch again to trigger the Jenkins. Please ignore the previous failure which is due to applying in branch.

          Show
          rakeshr Rakesh R added a comment - Re-attaching the trunk patch again to trigger the Jenkins. Please ignore the previous failure which is due to applying in branch.
          Hide
          hadoopqa Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12822569/ZOOKEEPER-2247-23.patch
          against trunk revision 1755379.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 6 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3333//testReport/
          Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3333//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3333//console

          This message is automatically generated.

          Show
          hadoopqa Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12822569/ZOOKEEPER-2247-23.patch against trunk revision 1755379. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3333//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3333//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/3333//console This message is automatically generated.
          Hide
          rakeshr Rakesh R added a comment -

          Test case failure is unrelated to the patch, which will be taken care by ZOOKEEPER-2152 jira.

               [exec]      [exec] Zookeeper_readOnly::testReadOnly : elapsed 4153 : OK
               [exec]      [exec] /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/src/c/tests/TestReconfig.cc:154: Assertion: assertion failed [Expression: false]
               [exec]      [exec] Failures !!!
               [exec]      [exec] Run: 72   Failure total: 1   Failures: 1   Errors: 0
               [exec]      [exec] FAIL: zktest-mt
          
          Show
          rakeshr Rakesh R added a comment - Test case failure is unrelated to the patch, which will be taken care by ZOOKEEPER-2152 jira. [exec] [exec] Zookeeper_readOnly::testReadOnly : elapsed 4153 : OK [exec] [exec] /home/jenkins/jenkins-slave/workspace/PreCommit-ZOOKEEPER-Build/trunk/src/c/tests/TestReconfig.cc:154: Assertion: assertion failed [Expression: false ] [exec] [exec] Failures !!! [exec] [exec] Run: 72 Failure total: 1 Failures: 1 Errors: 0 [exec] [exec] FAIL: zktest-mt
          Hide
          hudson Hudson added a comment -

          SUCCESS: Integrated in ZooKeeper-trunk #3035 (See https://builds.apache.org/job/ZooKeeper-trunk/3035/)
          ZOOKEEPER-2247: Zookeeper service becomes unavailable when leader fails to write transaction log (Rakesh via fpj) (fpj: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1756262)

          • trunk/CHANGES.txt
          • trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java
          • trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerListenerImpl.java
          • trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerMain.java
          • trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerShutdownHandler.java
          • trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java
          • trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
          • trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java
          • trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerZooKeeperServer.java
          • trunk/src/java/main/org/apache/zookeeper/server/quorum/Observer.java
          • trunk/src/java/main/org/apache/zookeeper/server/quorum/ObserverZooKeeperServer.java
          • trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumZooKeeperServer.java
          • trunk/src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyZooKeeperServer.java
          • trunk/src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java
          • trunk/src/java/test/org/apache/zookeeper/test/NonRecoverableErrorTest.java
          Show
          hudson Hudson added a comment - SUCCESS: Integrated in ZooKeeper-trunk #3035 (See https://builds.apache.org/job/ZooKeeper-trunk/3035/ ) ZOOKEEPER-2247 : Zookeeper service becomes unavailable when leader fails to write transaction log (Rakesh via fpj) (fpj: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1756262 ) trunk/CHANGES.txt trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerListenerImpl.java trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerMain.java trunk/src/java/main/org/apache/zookeeper/server/ZooKeeperServerShutdownHandler.java trunk/src/java/main/org/apache/zookeeper/server/quorum/Follower.java trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java trunk/src/java/main/org/apache/zookeeper/server/quorum/Learner.java trunk/src/java/main/org/apache/zookeeper/server/quorum/LearnerZooKeeperServer.java trunk/src/java/main/org/apache/zookeeper/server/quorum/Observer.java trunk/src/java/main/org/apache/zookeeper/server/quorum/ObserverZooKeeperServer.java trunk/src/java/main/org/apache/zookeeper/server/quorum/QuorumZooKeeperServer.java trunk/src/java/main/org/apache/zookeeper/server/quorum/ReadOnlyZooKeeperServer.java trunk/src/java/test/org/apache/zookeeper/server/ZooKeeperServerMainTest.java trunk/src/java/test/org/apache/zookeeper/test/NonRecoverableErrorTest.java
          Hide
          fpj Flavio Junqueira added a comment -

          +1, thanks for the hard work Rakesh R

          Branch 3.4: Committed revision 1756260.
          Branch 3.5: Committed revision 1756270.
          Trunk: Committed revision 1756262.

          Show
          fpj Flavio Junqueira added a comment - +1, thanks for the hard work Rakesh R Branch 3.4: Committed revision 1756260. Branch 3.5: Committed revision 1756270. Trunk: Committed revision 1756262.
          Hide
          rakeshr Rakesh R added a comment -

          Thanks a lot Flavio Junqueira for the continuous support in resolving this issue. Also, thanking Raul Gutierrez Segales, Chris Nauroth, Michael Han, Mohammad Arshad for your reviews and inputs.

          Show
          rakeshr Rakesh R added a comment - Thanks a lot Flavio Junqueira for the continuous support in resolving this issue. Also, thanking Raul Gutierrez Segales , Chris Nauroth , Michael Han , Mohammad Arshad for your reviews and inputs.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user hanm commented on the issue:

          https://github.com/apache/zookeeper/pull/65

          @rakeshadr I think this pr can be closed because it's merged already.

          Show
          githubbot ASF GitHub Bot added a comment - Github user hanm commented on the issue: https://github.com/apache/zookeeper/pull/65 @rakeshadr I think this pr can be closed because it's merged already.

            People

            • Assignee:
              rakeshr Rakesh R
              Reporter:
              arshad.mohammad Mohammad Arshad
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development