ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1270

testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4.0, 3.5.0
    • Component/s: server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Looks pretty serious - quorum is formed but no clients can attach. Will attach logs momentarily.

      This test was introduced in the following commit (all three jira commit at once):
      ZOOKEEPER-335. zookeeper servers should commit the new leader txn to their logs.
      ZOOKEEPER-1081. modify leader/follower code to correctly deal with new leader
      ZOOKEEPER-1082. modify leader election to correctly take into account current

      1. ZOOKEEPER-1270tests2.patch
        5 kB
        Camille Fournier
      2. ZOOKEEPER-1270tests.patch
        6 kB
        Camille Fournier
      3. ZOOKEEPER-1270-and-1194-branch34.patch
        10 kB
        Alexander Shraer
      4. ZOOKEEPER-1270-and-1194.patch
        11 kB
        Alexander Shraer
      5. ZOOKEEPER-1270-and-1194.patch
        11 kB
        Alexander Shraer
      6. zookeeper-1270-1194-34.patch
        8 kB
        Camille Fournier
      7. ZOOKEEPER-1270-1194.patch
        8 kB
        Camille Fournier
      8. ZOOKEEPER-1270.patch
        2 kB
        Flavio Junqueira
      9. ZOOKEEPER-1270.patch
        7 kB
        Patrick Hunt
      10. ZOOKEEPER-1270_br34.patch
        7 kB
        Patrick Hunt
      11. testEarlyLeaderAbandonment5.txt.gz
        54 kB
        Patrick Hunt
      12. testEarlyLeaderAbandonment4.txt.gz
        47 kB
        Patrick Hunt
      13. testEarlyLeaderAbandonment3.txt.gz
        59 kB
        Patrick Hunt
      14. testEarlyLeaderAbandonment2.txt.gz
        48 kB
        Patrick Hunt
      15. testEarlyLeaderAbandonment.txt.gz
        43 kB
        Patrick Hunt

        Issue Links

          Activity

          Hide
          Patrick Hunt added a comment -

          It looks like this issue might be related to Read only mode - after applying the patch from ZOOKEEPER-1268 I could not get this to happen again. I ran the QuorumPeerMain test over 30 times without a single failure, I run the full test suite 3 times without any issues. It's looking like ROM is somehow causing this failure - which seems likely given the logs and the way ROM operates.

          I'm moving this out of 3.4.0 now that ZOOKEEPER-1268 is applied there and I can no longer reproduce.

          Show
          Patrick Hunt added a comment - It looks like this issue might be related to Read only mode - after applying the patch from ZOOKEEPER-1268 I could not get this to happen again. I ran the QuorumPeerMain test over 30 times without a single failure, I run the full test suite 3 times without any issues. It's looking like ROM is somehow causing this failure - which seems likely given the logs and the way ROM operates. I'm moving this out of 3.4.0 now that ZOOKEEPER-1268 is applied there and I can no longer reproduce.
          Hide
          Patrick Hunt added a comment -

          Looks like this is still hanging around, see time 2011-11-02 22:11:27,124

          https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper_branch34/49/testReport/junit/org.apache.zookeeper.server.quorum/QuorumPeerMainTest/testEarlyLeaderAbandonment/

          2011-11-02 22:11:27,113 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:41413:LearnerHandler@264] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@166340c
          2011-11-02 22:11:27,113 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:41414:LearnerHandler@264] - Follower sid: 0 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@101ac93
          2011-11-02 22:11:27,123 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:41413:LearnerHandler@319] - Synchronizing with Follower sid: 1 maxCommittedLog =100000003 minCommittedLog = 100000001 peerLastZxid = 100000003
          2011-11-02 22:11:27,123 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:41414:LearnerHandler@319] - Synchronizing with Follower sid: 0 maxCommittedLog =100000003 minCommittedLog = 100000001 peerLastZxid = 100000003
          2011-11-02 22:11:27,123 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:41413:LearnerHandler@407] - Sending snapshot last zxid of peer is 0x100000003  zxid of leader is 0x0sent zxid of db as 0x100000003
          2011-11-02 22:11:27,124 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:41414:LearnerHandler@407] - Sending snapshot last zxid of peer is 0x100000003  zxid of leader is 0x0sent zxid of db as 0x100000003
          2011-11-02 22:11:27,123 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11230:Learner@323] - Getting a snapshot from leader
          2011-11-02 22:11:27,124 [myid:0] - INFO  [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11227:Learner@323] - Getting a snapshot from leader
          2011-11-02 22:11:27,125 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11230:FileTxnSnapLog@255] - Snapshotting: 100000003
          2011-11-02 22:11:27,125 [myid:0] - INFO  [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11227:FileTxnSnapLog@255] - Snapshotting: 100000003
          2011-11-02 22:11:27,373 [myid:] - INFO  [main-SendThread(localhost:11233):ClientCnxn$SendThread@933] - Opening socket connection to server localhost/127.0.0.1:11233
          2011-11-02 22:11:27,373 [myid:] - INFO  [main-SendThread(localhost:11233):ClientCnxn$SendThread@846] - Socket connection established to localhost/127.0.0.1:11233, initiating session
          2011-11-02 22:11:27,373 [myid:2] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11233:NIOServerCnxnFactory@213] - Accepted socket connection from /127.0.0.1:37239
          2011-11-02 22:11:27,374 [myid:2] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11233:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
          2011-11-02 22:11:27,374 [myid:2] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11233:NIOServerCnxn@1000] - Closed socket connection for client /127.0.0.1:37239 (no session established for client)
          2011-11-02 22:11:27,374 [myid:] - INFO  [main-SendThread(localhost:11233):ClientCnxn$SendThread@1059] - Unable to read additional data from server sessionid 0x233665616f20000, likely server has closed socket, closing socket connection and attempting reconnect
          2011-11-02 22:11:27,714 [myid:] - INFO  [main-SendThread(localhost:11230):ClientCnxn$SendThread@933] - Opening socket connection to server localhost/127.0.0.1:11230
          2011-11-02 22:11:27,714 [myid:] - INFO  [main-SendThread(localhost:11230):ClientCnxn$SendThread@846] - Socket connection established to localhost/127.0.0.1:11230, initiating session
          2011-11-02 22:11:27,714 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11230:NIOServerCnxnFactory@213] - Accepted socket connection from /127.0.0.1:33215
          2011-11-02 22:11:27,715 [myid:1] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11230:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
          2011-11-02 22:11:27,715 [myid:1] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11230:NIOServerCnxn@1000] - Closed socket connection for client /127.0.0.1:33215 (no session established for client)
          2011-11-02 22:11:27,715 [myid:] - INFO  [main-SendThread(localhost:11230):ClientCnxn$SendThread@1059] - Unable to read additional data from server sessionid 0x133665616f50000, likely server has closed socket, closing socket connection and attempting reconnect
          2011-11-02 22:11:27,843 [myid:] - INFO  [main-SendThread(localhost:11227):ClientCnxn$SendThread@933] - Opening socket connection to server localhost/127.0.0.1:11227
          2011-11-02 22:11:27,843 [myid:] - INFO  [main-SendThread(localhost:11227):ClientCnxn$SendThread@846] - Socket connection established to localhost/127.0.0.1:11227, initiating session
          
          Show
          Patrick Hunt added a comment - Looks like this is still hanging around, see time 2011-11-02 22:11:27,124 https://builds.apache.org/view/S-Z/view/ZooKeeper/job/ZooKeeper_branch34/49/testReport/junit/org.apache.zookeeper.server.quorum/QuorumPeerMainTest/testEarlyLeaderAbandonment/ 2011-11-02 22:11:27,113 [myid:2] - INFO [LearnerHandler-/127.0.0.1:41413:LearnerHandler@264] - Follower sid: 1 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@166340c 2011-11-02 22:11:27,113 [myid:2] - INFO [LearnerHandler-/127.0.0.1:41414:LearnerHandler@264] - Follower sid: 0 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@101ac93 2011-11-02 22:11:27,123 [myid:2] - INFO [LearnerHandler-/127.0.0.1:41413:LearnerHandler@319] - Synchronizing with Follower sid: 1 maxCommittedLog =100000003 minCommittedLog = 100000001 peerLastZxid = 100000003 2011-11-02 22:11:27,123 [myid:2] - INFO [LearnerHandler-/127.0.0.1:41414:LearnerHandler@319] - Synchronizing with Follower sid: 0 maxCommittedLog =100000003 minCommittedLog = 100000001 peerLastZxid = 100000003 2011-11-02 22:11:27,123 [myid:2] - INFO [LearnerHandler-/127.0.0.1:41413:LearnerHandler@407] - Sending snapshot last zxid of peer is 0x100000003 zxid of leader is 0x0sent zxid of db as 0x100000003 2011-11-02 22:11:27,124 [myid:2] - INFO [LearnerHandler-/127.0.0.1:41414:LearnerHandler@407] - Sending snapshot last zxid of peer is 0x100000003 zxid of leader is 0x0sent zxid of db as 0x100000003 2011-11-02 22:11:27,123 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11230:Learner@323] - Getting a snapshot from leader 2011-11-02 22:11:27,124 [myid:0] - INFO [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11227:Learner@323] - Getting a snapshot from leader 2011-11-02 22:11:27,125 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11230:FileTxnSnapLog@255] - Snapshotting: 100000003 2011-11-02 22:11:27,125 [myid:0] - INFO [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11227:FileTxnSnapLog@255] - Snapshotting: 100000003 2011-11-02 22:11:27,373 [myid:] - INFO [main-SendThread(localhost:11233):ClientCnxn$SendThread@933] - Opening socket connection to server localhost/127.0.0.1:11233 2011-11-02 22:11:27,373 [myid:] - INFO [main-SendThread(localhost:11233):ClientCnxn$SendThread@846] - Socket connection established to localhost/127.0.0.1:11233, initiating session 2011-11-02 22:11:27,373 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11233:NIOServerCnxnFactory@213] - Accepted socket connection from /127.0.0.1:37239 2011-11-02 22:11:27,374 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11233:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2011-11-02 22:11:27,374 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11233:NIOServerCnxn@1000] - Closed socket connection for client /127.0.0.1:37239 (no session established for client) 2011-11-02 22:11:27,374 [myid:] - INFO [main-SendThread(localhost:11233):ClientCnxn$SendThread@1059] - Unable to read additional data from server sessionid 0x233665616f20000, likely server has closed socket, closing socket connection and attempting reconnect 2011-11-02 22:11:27,714 [myid:] - INFO [main-SendThread(localhost:11230):ClientCnxn$SendThread@933] - Opening socket connection to server localhost/127.0.0.1:11230 2011-11-02 22:11:27,714 [myid:] - INFO [main-SendThread(localhost:11230):ClientCnxn$SendThread@846] - Socket connection established to localhost/127.0.0.1:11230, initiating session 2011-11-02 22:11:27,714 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11230:NIOServerCnxnFactory@213] - Accepted socket connection from /127.0.0.1:33215 2011-11-02 22:11:27,715 [myid:1] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11230:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running 2011-11-02 22:11:27,715 [myid:1] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11230:NIOServerCnxn@1000] - Closed socket connection for client /127.0.0.1:33215 (no session established for client) 2011-11-02 22:11:27,715 [myid:] - INFO [main-SendThread(localhost:11230):ClientCnxn$SendThread@1059] - Unable to read additional data from server sessionid 0x133665616f50000, likely server has closed socket, closing socket connection and attempting reconnect 2011-11-02 22:11:27,843 [myid:] - INFO [main-SendThread(localhost:11227):ClientCnxn$SendThread@933] - Opening socket connection to server localhost/127.0.0.1:11227 2011-11-02 22:11:27,843 [myid:] - INFO [main-SendThread(localhost:11227):ClientCnxn$SendThread@846] - Socket connection established to localhost/127.0.0.1:11227, initiating session
          Hide
          Patrick Hunt added a comment -

          testEarlyLeaderAbandonment2.txt.gz contains the second failure logs - from apache jenkins on branch 34

          Show
          Patrick Hunt added a comment - testEarlyLeaderAbandonment2.txt.gz contains the second failure logs - from apache jenkins on branch 34
          Hide
          Mahadev konar added a comment -

          Looks like the zookeeperserver does not start running within the Quorum Peers. There is something really wrong which prevents the Followers/leaders to start running the ZooKeeperServers. I suspect, it has something to do with NEWLeader transaction (could be wrong). Need to look deeper. Another pair of eyes would help!

          Show
          Mahadev konar added a comment - Looks like the zookeeperserver does not start running within the Quorum Peers. There is something really wrong which prevents the Followers/leaders to start running the ZooKeeperServers. I suspect, it has something to do with NEWLeader transaction (could be wrong). Need to look deeper. Another pair of eyes would help!
          Hide
          Flavio Junqueira added a comment -

          For coordination purposes, I'm having a look at this issue.

          Show
          Flavio Junqueira added a comment - For coordination purposes, I'm having a look at this issue.
          Hide
          Flavio Junqueira added a comment -

          Here is my progress so far. Mahadev is right, it sounds like we are not setting the zkServer of ServerCnxFactory, which we do both in Learner and Leader by calling self.cnxnFactory.setZooKeeperServer(zk). We should also be seeing a message like:

          Have quorum of supporters; starting up and setting last processed zxid:
          

          in the log file Pat posted, but we aren't, which implies that the leader establishment procedure is not completing. Is this a timing issue? I'm skeptical about it being a time issue because we wait 10 seconds for the waitForAll call to complete, but I'm not sure if this completely unrealistic or not assuming that the jenkins machine is overloaded.

          Is there anything I'm missing here? I'll keep investigating...

          Show
          Flavio Junqueira added a comment - Here is my progress so far. Mahadev is right, it sounds like we are not setting the zkServer of ServerCnxFactory, which we do both in Learner and Leader by calling self.cnxnFactory.setZooKeeperServer(zk). We should also be seeing a message like: Have quorum of supporters; starting up and setting last processed zxid: in the log file Pat posted, but we aren't, which implies that the leader establishment procedure is not completing. Is this a timing issue? I'm skeptical about it being a time issue because we wait 10 seconds for the waitForAll call to complete, but I'm not sure if this completely unrealistic or not assuming that the jenkins machine is overloaded. Is there anything I'm missing here? I'll keep investigating...
          Hide
          Flavio Junqueira added a comment -

          I forgot to mention that I haven't been able to reproduce it, and I have been running for the past few hours.

          Show
          Flavio Junqueira added a comment - I forgot to mention that I haven't been able to reproduce it, and I have been running for the past few hours.
          Hide
          Camille Fournier added a comment -

          You have come to the same conclusion I did. The leader sends snapshots to the followers and they both log that they are taking a snaphsot (implying that we got the NEWLEADER message), but you don't see anything in the log that indicates the leader actually managed to call processAck on either of the responses, where you should see something like:
          2011-11-03 13:07:49,717 [myid:2] - WARN [LearnerHandler-/127.0.0.1:63083:Leader@506] - Commiting zxid 0x200000000 from /127.0.0.1:11228 not first!
          2011-11-03 13:07:49,717 [myid:2] - WARN [LearnerHandler-/127.0.0.1:63083:Leader@508] - First is 0

          Not sure why yet though.

          Show
          Camille Fournier added a comment - You have come to the same conclusion I did. The leader sends snapshots to the followers and they both log that they are taking a snaphsot (implying that we got the NEWLEADER message), but you don't see anything in the log that indicates the leader actually managed to call processAck on either of the responses, where you should see something like: 2011-11-03 13:07:49,717 [myid:2] - WARN [LearnerHandler-/127.0.0.1:63083:Leader@506] - Commiting zxid 0x200000000 from /127.0.0.1:11228 not first! 2011-11-03 13:07:49,717 [myid:2] - WARN [LearnerHandler-/127.0.0.1:63083:Leader@508] - First is 0 Not sure why yet though.
          Hide
          Camille Fournier added a comment -

          It would be nice if we could get a full stack dump on failure here. It seems like if this is some sort of a deadlock issue that would show it pretty quick. I will see if I can add that to QuorumPeerMainTest, to possibly get some more information.

          Show
          Camille Fournier added a comment - It would be nice if we could get a full stack dump on failure here. It seems like if this is some sort of a deadlock issue that would show it pretty quick. I will see if I can add that to QuorumPeerMainTest, to possibly get some more information.
          Hide
          Camille Fournier added a comment -

          tests that will print the whole stack trace in case of failure to reach client state

          Show
          Camille Fournier added a comment - tests that will print the whole stack trace in case of failure to reach client state
          Hide
          Camille Fournier added a comment -

          There's some extraneous stuff in ClientBase, but if anyone can repro this bug locally and run it with this stack tracing on that would be useful.

          Show
          Camille Fournier added a comment - There's some extraneous stuff in ClientBase, but if anyone can repro this bug locally and run it with this stack tracing on that would be useful.
          Hide
          Flavio Junqueira added a comment -

          I haven't been able to reproduce the problem, but I'll leave it running in a loop. On your comment about ClientBase, Camille, I was wondering if you have any other insight.

          Show
          Flavio Junqueira added a comment - I haven't been able to reproduce the problem, but I'll leave it running in a loop. On your comment about ClientBase, Camille, I was wondering if you have any other insight.
          Hide
          Camille Fournier added a comment -

          It seems to me that everything comes up ok, and starts the election process, elects a leader, and gets a snapshot from the leader. But in the logs where you have 2 followers very closely synched in time (never my local box but seems to happen on the build boxes occasionally), after the followers have claimed to write a snapshot to disk (which means they must have gotten the NEWLEADER message) the whole process then stops, and you see no logs from the leader indicating it ran processAck for either follower. It feels to me like it could be a race condition in the leader somewhere, causing it to somehow miss that ACK but I can't seem to find it. Nothing in the diffs from the checkin related to ZAB1.0 seem to be much of a culprit... I'm a bit stumped but going to keep looking.

          Show
          Camille Fournier added a comment - It seems to me that everything comes up ok, and starts the election process, elects a leader, and gets a snapshot from the leader. But in the logs where you have 2 followers very closely synched in time (never my local box but seems to happen on the build boxes occasionally), after the followers have claimed to write a snapshot to disk (which means they must have gotten the NEWLEADER message) the whole process then stops, and you see no logs from the leader indicating it ran processAck for either follower. It feels to me like it could be a race condition in the leader somewhere, causing it to somehow miss that ACK but I can't seem to find it. Nothing in the diffs from the checkin related to ZAB1.0 seem to be much of a culprit... I'm a bit stumped but going to keep looking.
          Hide
          Patrick Hunt added a comment -

          Camille - This patch is against trunk, is that true? (fails to apply against 3.4)

          Notice that the original traces are against 3.4 branch.

          I'm running this patch against my CI infra. will let you know.

          Show
          Patrick Hunt added a comment - Camille - This patch is against trunk, is that true? (fails to apply against 3.4) Notice that the original traces are against 3.4 branch. I'm running this patch against my CI infra. will let you know.
          Hide
          Patrick Hunt added a comment -

          here: (trunk)

          2011-11-03 17:03:55,228 [myid:] - ERROR [main:ClientBase@544] - Starting logAllStackTraces()
          Thread Sender-/127.0.0.1:48422
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
          	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
          	at org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:137)
          	at org.apache.zookeeper.server.quorum.LearnerHandler.access$000(LearnerHandler.java:56)
          	at org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:425)
          Thread Finalizer
          	at java.lang.Object.wait(Native Method)
          	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
          	at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
          	at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
          Thread NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11227
          	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
          	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
          	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
          	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
          	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
          	at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:194)
          	at java.lang.Thread.run(Thread.java:619)
          Thread QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11233
          	at java.lang.Thread.sleep(Native Method)
          	at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:361)
          	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:746)
          Thread main-EventThread
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
          	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
          	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:493)
          Thread SendWorker:0
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665)
          Thread SendWorker:0
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665)
          Thread Reference Handler
          	at java.lang.Object.wait(Native Method)
          	at java.lang.Object.wait(Object.java:485)
          	at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
          Thread LearnerHandler-/127.0.0.1:48421
          	at java.lang.Object.wait(Native Method)
          	at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
          Thread Thread-20
          	at java.lang.Object.wait(Native Method)
          	at java.lang.Thread.join(Thread.java:1143)
          	at java.lang.Thread.join(Thread.java:1196)
          	at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
          	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:110)
          	at org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:107)
          	at java.lang.Thread.run(Thread.java:619)
          Thread main-EventThread
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
          	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
          	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:493)
          Thread /127.0.0.1:11229
          	at java.net.PlainSocketImpl.socketAccept(Native Method)
          	at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
          	at java.net.ServerSocket.implAccept(ServerSocket.java:453)
          	at java.net.ServerSocket.accept(ServerSocket.java:421)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:489)
          Thread SendWorker:1
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665)
          Thread NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11233
          	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
          	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
          	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
          	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
          	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
          	at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:194)
          	at java.lang.Thread.run(Thread.java:619)
          Thread RecvWorker:1
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:129)
          	at java.net.SocketInputStream.read(SocketInputStream.java:182)
          	at java.io.DataInputStream.readInt(DataInputStream.java:370)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745)
          Thread Thread-26
          	at java.net.PlainSocketImpl.socketAccept(Native Method)
          	at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
          	at java.net.ServerSocket.implAccept(ServerSocket.java:453)
          	at java.net.ServerSocket.accept(ServerSocket.java:421)
          	at org.apache.zookeeper.server.quorum.Leader$LearnerCnxAcceptor.run(Leader.java:250)
          Thread /127.0.0.1:11235
          	at java.net.PlainSocketImpl.socketAccept(Native Method)
          	at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
          	at java.net.ServerSocket.implAccept(ServerSocket.java:453)
          	at java.net.ServerSocket.accept(ServerSocket.java:421)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:489)
          Thread SendWorker:2
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665)
          Thread main-SendThread(localhost:11233)
          	at java.lang.Thread.sleep(Native Method)
          	at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:925)
          	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:979)
          Thread WorkerReceiver[myid=0]
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollRecvQueue(QuorumCnxManager.java:881)
          	at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:205)
          	at java.lang.Thread.run(Thread.java:619)
          Thread /127.0.0.1:11232
          	at java.net.PlainSocketImpl.socketAccept(Native Method)
          	at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
          	at java.net.ServerSocket.implAccept(ServerSocket.java:453)
          	at java.net.ServerSocket.accept(ServerSocket.java:421)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:489)
          Thread WorkerSender[myid=2]
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
          	at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:362)
          	at java.lang.Thread.run(Thread.java:619)
          Thread SendWorker:2
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665)
          Thread RecvWorker:2
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:129)
          	at java.net.SocketInputStream.read(SocketInputStream.java:182)
          	at java.io.DataInputStream.readInt(DataInputStream.java:370)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745)
          Thread Thread-22
          	at java.lang.Object.wait(Native Method)
          	at java.lang.Thread.join(Thread.java:1143)
          	at java.lang.Thread.join(Thread.java:1196)
          	at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
          	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:110)
          	at org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:107)
          	at java.lang.Thread.run(Thread.java:619)
          Thread SendWorker:1
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665)
          Thread RecvWorker:0
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:129)
          	at java.net.SocketInputStream.read(SocketInputStream.java:182)
          	at java.io.DataInputStream.readInt(DataInputStream.java:370)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745)
          Thread Signal Dispatcher
          Thread RecvWorker:0
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:129)
          	at java.net.SocketInputStream.read(SocketInputStream.java:182)
          	at java.io.DataInputStream.readInt(DataInputStream.java:370)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745)
          Thread RecvWorker:2
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:129)
          	at java.net.SocketInputStream.read(SocketInputStream.java:182)
          	at java.io.DataInputStream.readInt(DataInputStream.java:370)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745)
          Thread LearnerHandler-/127.0.0.1:48422
          	at java.lang.Object.wait(Native Method)
          	at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
          Thread WorkerReceiver[myid=2]
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollRecvQueue(QuorumCnxManager.java:881)
          	at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:205)
          	at java.lang.Thread.run(Thread.java:619)
          Thread Thread-21
          	at java.lang.Object.wait(Native Method)
          	at java.lang.Thread.join(Thread.java:1143)
          	at java.lang.Thread.join(Thread.java:1196)
          	at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
          	at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:110)
          	at org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:107)
          	at java.lang.Thread.run(Thread.java:619)
          Thread NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11230
          	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
          	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210)
          	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
          	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
          	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
          	at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:194)
          	at java.lang.Thread.run(Thread.java:619)
          Thread main
          	at java.lang.Thread.dumpThreads(Native Method)
          	at java.lang.Thread.getAllStackTraces(Thread.java:1487)
          	at org.apache.zookeeper.test.ClientBase.logAllStackTraces(ClientBase.java:536)
          	at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.waitForAll(QuorumPeerMainTest.java:319)
          	at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testEarlyLeaderAbandonment(QuorumPeerMainTest.java:148)
          	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
          	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
          	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
          	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
          	at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
          	at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:52)
          	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
          	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:69)
          	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:48)
          	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
          	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
          	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
          	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
          	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
          	at org.junit.runners.ParentRunner.run(ParentRunner.java:292)
          	at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39)
          	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422)
          	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931)
          	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785)
          Thread main-SendThread(localhost:11230)
          	at java.lang.Thread.sleep(Native Method)
          	at org.apache.zookeeper.ClientCnxnSocketNIO.cleanup(ClientCnxnSocketNIO.java:161)
          	at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1131)
          	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1071)
          Thread main-SendThread(localhost:11227)
          	at java.lang.Thread.sleep(Native Method)
          	at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:925)
          	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:979)
          Thread main-EventThread
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
          	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
          	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:493)
          Thread WorkerSender[myid=1]
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
          	at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:362)
          	at java.lang.Thread.run(Thread.java:619)
          Thread QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11230
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:129)
          	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
          	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
          	at java.io.DataInputStream.readInt(DataInputStream.java:370)
          	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
          	at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
          	at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
          	at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
          	at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:363)
          	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:80)
          	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:733)
          Thread WorkerSender[myid=0]
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
          	at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:362)
          	at java.lang.Thread.run(Thread.java:619)
          Thread RecvWorker:1
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:129)
          	at java.net.SocketInputStream.read(SocketInputStream.java:182)
          	at java.io.DataInputStream.readInt(DataInputStream.java:370)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745)
          Thread Sender-/127.0.0.1:48421
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
          	at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399)
          	at org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:137)
          	at org.apache.zookeeper.server.quorum.LearnerHandler.access$000(LearnerHandler.java:56)
          	at org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:425)
          Thread QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11227
          	at java.net.SocketInputStream.socketRead0(Native Method)
          	at java.net.SocketInputStream.read(SocketInputStream.java:129)
          	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
          	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
          	at java.io.DataInputStream.readInt(DataInputStream.java:370)
          	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
          	at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
          	at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
          	at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
          	at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:363)
          	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:80)
          	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:733)
          Thread WorkerReceiver[myid=1]
          	at sun.misc.Unsafe.park(Native Method)
          	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963)
          	at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
          	at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollRecvQueue(QuorumCnxManager.java:881)
          	at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:205)
          	at java.lang.Thread.run(Thread.java:619)
          Ending logAllStackTraces()
          
          Show
          Patrick Hunt added a comment - here: (trunk) 2011-11-03 17:03:55,228 [myid:] - ERROR [main:ClientBase@544] - Starting logAllStackTraces() Thread Sender-/127.0.0.1:48422 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:137) at org.apache.zookeeper.server.quorum.LearnerHandler.access$000(LearnerHandler.java:56) at org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:425) Thread Finalizer at java.lang.Object.wait(Native Method) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) Thread NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11227 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:194) at java.lang.Thread.run(Thread.java:619) Thread QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:11233 at java.lang.Thread.sleep(Native Method) at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:361) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:746) Thread main-EventThread at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:493) Thread SendWorker:0 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665) Thread SendWorker:0 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665) Thread Reference Handler at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) Thread LearnerHandler-/127.0.0.1:48421 at java.lang.Object.wait(Native Method) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450) Thread Thread-20 at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:110) at org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:107) at java.lang.Thread.run(Thread.java:619) Thread main-EventThread at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:493) Thread /127.0.0.1:11229 at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) at java.net.ServerSocket.implAccept(ServerSocket.java:453) at java.net.ServerSocket.accept(ServerSocket.java:421) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:489) Thread SendWorker:1 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665) Thread NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11233 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:194) at java.lang.Thread.run(Thread.java:619) Thread RecvWorker:1 at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.net.SocketInputStream.read(SocketInputStream.java:182) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745) Thread Thread-26 at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) at java.net.ServerSocket.implAccept(ServerSocket.java:453) at java.net.ServerSocket.accept(ServerSocket.java:421) at org.apache.zookeeper.server.quorum.Leader$LearnerCnxAcceptor.run(Leader.java:250) Thread /127.0.0.1:11235 at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) at java.net.ServerSocket.implAccept(ServerSocket.java:453) at java.net.ServerSocket.accept(ServerSocket.java:421) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:489) Thread SendWorker:2 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665) Thread main-SendThread(localhost:11233) at java.lang.Thread.sleep(Native Method) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:925) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:979) Thread WorkerReceiver[myid=0] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollRecvQueue(QuorumCnxManager.java:881) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:205) at java.lang.Thread.run(Thread.java:619) Thread /127.0.0.1:11232 at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390) at java.net.ServerSocket.implAccept(ServerSocket.java:453) at java.net.ServerSocket.accept(ServerSocket.java:421) at org.apache.zookeeper.server.quorum.QuorumCnxManager$Listener.run(QuorumCnxManager.java:489) Thread WorkerSender[myid=2] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:362) at java.lang.Thread.run(Thread.java:619) Thread SendWorker:2 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665) Thread RecvWorker:2 at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.net.SocketInputStream.read(SocketInputStream.java:182) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745) Thread Thread-22 at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:110) at org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:107) at java.lang.Thread.run(Thread.java:619) Thread SendWorker:1 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:829) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:60) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:665) Thread RecvWorker:0 at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.net.SocketInputStream.read(SocketInputStream.java:182) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745) Thread Signal Dispatcher Thread RecvWorker:0 at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.net.SocketInputStream.read(SocketInputStream.java:182) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745) Thread RecvWorker:2 at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.net.SocketInputStream.read(SocketInputStream.java:182) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745) Thread LearnerHandler-/127.0.0.1:48422 at java.lang.Object.wait(Native Method) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450) Thread WorkerReceiver[myid=2] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollRecvQueue(QuorumCnxManager.java:881) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:205) at java.lang.Thread.run(Thread.java:619) Thread Thread-21 at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151) at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:110) at org.apache.zookeeper.server.quorum.QuorumPeerTestBase$MainThread.run(QuorumPeerTestBase.java:107) at java.lang.Thread.run(Thread.java:619) Thread NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11230 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:194) at java.lang.Thread.run(Thread.java:619) Thread main at java.lang.Thread.dumpThreads(Native Method) at java.lang.Thread.getAllStackTraces(Thread.java:1487) at org.apache.zookeeper.test.ClientBase.logAllStackTraces(ClientBase.java:536) at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.waitForAll(QuorumPeerMainTest.java:319) at org.apache.zookeeper.server.quorum.QuorumPeerMainTest.testEarlyLeaderAbandonment(QuorumPeerMainTest.java:148) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20) at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52) at org.junit.rules.TestWatchman$1.evaluate(TestWatchman.java:52) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:69) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:48) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222) at org.junit.runners.ParentRunner.run(ParentRunner.java:292) at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:39) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:422) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:931) at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:785) Thread main-SendThread(localhost:11230) at java.lang.Thread.sleep(Native Method) at org.apache.zookeeper.ClientCnxnSocketNIO.cleanup(ClientCnxnSocketNIO.java:161) at org.apache.zookeeper.ClientCnxn$SendThread.cleanup(ClientCnxn.java:1131) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1071) Thread main-SendThread(localhost:11227) at java.lang.Thread.sleep(Native Method) at org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:925) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:979) Thread main-EventThread at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:493) Thread WorkerSender[myid=1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:362) at java.lang.Thread.run(Thread.java:619) Thread QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11230 at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:363) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:80) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:733) Thread WorkerSender[myid=0] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:362) at java.lang.Thread.run(Thread.java:619) Thread RecvWorker:1 at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.net.SocketInputStream.read(SocketInputStream.java:182) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:745) Thread Sender-/127.0.0.1:48421 at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at org.apache.zookeeper.server.quorum.LearnerHandler.sendPackets(LearnerHandler.java:137) at org.apache.zookeeper.server.quorum.LearnerHandler.access$000(LearnerHandler.java:56) at org.apache.zookeeper.server.quorum.LearnerHandler$1.run(LearnerHandler.java:425) Thread QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11227 at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:129) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) at java.io.DataInputStream.readInt(DataInputStream.java:370) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152) at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:363) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:80) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:733) Thread WorkerReceiver[myid=1] at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:1963) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollRecvQueue(QuorumCnxManager.java:881) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerReceiver.run(FastLeaderElection.java:205) at java.lang.Thread.run(Thread.java:619) Ending logAllStackTraces()
          Hide
          Camille Fournier added a comment -

          Looking at this some more I'm not entirely convinced it isn't a timing issue:

          I'm skeptical about it being a time issue because we wait 10 seconds for the waitForAll call to complete, but I'm not sure if this completely unrealistic or not assuming that the jenkins machine is overloaded.

          I actually have the startup and shutdown running in a loop on my box. The one time I managed to get it to fail was due to 10 seconds not being a long enough wait time. The servers were almost up, in fact, but election just took a little while as did snapshotting etc and it never succeeded.

          Show
          Camille Fournier added a comment - Looking at this some more I'm not entirely convinced it isn't a timing issue: I'm skeptical about it being a time issue because we wait 10 seconds for the waitForAll call to complete, but I'm not sure if this completely unrealistic or not assuming that the jenkins machine is overloaded. I actually have the startup and shutdown running in a loop on my box. The one time I managed to get it to fail was due to 10 seconds not being a long enough wait time. The servers were almost up, in fact, but election just took a little while as did snapshotting etc and it never succeeded.
          Hide
          Patrick Hunt added a comment -

          There's definitely a problem somewhere. See testEarlyLeaderAbandonment3.txt.gz

          Look at time 2011-11-04 03:41:24,779

          I modified the test to iterate 30 times (instead of 10) and sleep for 2 sec (rather than 1). I then ran this test continually overnight (300+times). It never failed, however when I looked at some of the console results I saw that it actually is failing, however the test is not catching this.

          So two issues really - 1) it looks like this is a real problem, from the exception it seems the leader has frozen? 2) the test needs to be updated regardless to fail in this case.

          Show
          Patrick Hunt added a comment - There's definitely a problem somewhere. See testEarlyLeaderAbandonment3.txt.gz Look at time 2011-11-04 03:41:24,779 I modified the test to iterate 30 times (instead of 10) and sleep for 2 sec (rather than 1). I then ran this test continually overnight (300+times). It never failed, however when I looked at some of the console results I saw that it actually is failing, however the test is not catching this. So two issues really - 1) it looks like this is a real problem, from the exception it seems the leader has frozen? 2) the test needs to be updated regardless to fail in this case.
          Hide
          Flavio Junqueira added a comment -

          Let me dump what I have in mind right now. After checking the logs and the stack trace Pat provided, it looks like the leader is waiting on the loop right before it sets the zookeeper server (Line 361, Leader), as we had observed before. It implies that the leader is not receiving acks from followers. Since the logs say that the followers are snapshotting, one option is that the snapshot is not concluding for some reason. Another alternative is that the acknowledgments to the leader are getting lost.

          Since Pat increased the parameter values to have a longer timeout, I'm even more skeptical that this is a timing issue, but I don't think we can rule it out yet.

          Show
          Flavio Junqueira added a comment - Let me dump what I have in mind right now. After checking the logs and the stack trace Pat provided, it looks like the leader is waiting on the loop right before it sets the zookeeper server (Line 361, Leader), as we had observed before. It implies that the leader is not receiving acks from followers. Since the logs say that the followers are snapshotting, one option is that the snapshot is not concluding for some reason. Another alternative is that the acknowledgments to the leader are getting lost. Since Pat increased the parameter values to have a longer timeout, I'm even more skeptical that this is a timing issue, but I don't think we can rule it out yet.
          Hide
          Patrick Hunt added a comment -

          If you update the patch with addl debugging, and fix the test to actually fail in this case, I'll be happy to run it again on my cluster.

          add some log detail when the followers complete snapshotting so we can rule this out/in.

          Show
          Patrick Hunt added a comment - If you update the patch with addl debugging, and fix the test to actually fail in this case, I'll be happy to run it again on my cluster. add some log detail when the followers complete snapshotting so we can rule this out/in.
          Hide
          Flavio Junqueira added a comment -

          This is the code snippet for the handling of UPTODATE in Learner.java:

                        case Leader.UPTODATE:
                              if (!snapshotTaken) {
                                  zk.takeSnapshot();
                              }
                              self.cnxnFactory.setZooKeeperServer(zk);                
                              break outerLoop;
          

          If the snapshot were completing, then it should be setting the zookeeper server. Given that we see the "Snapshotting" message in the logs, I assume that the follower is receiving an UPTODATE message, but it is failing to complete the snapshot.

          Show
          Flavio Junqueira added a comment - This is the code snippet for the handling of UPTODATE in Learner.java: case Leader.UPTODATE: if (!snapshotTaken) { zk.takeSnapshot(); } self.cnxnFactory.setZooKeeperServer(zk); break outerLoop; If the snapshot were completing, then it should be setting the zookeeper server. Given that we see the "Snapshotting" message in the logs, I assume that the follower is receiving an UPTODATE message, but it is failing to complete the snapshot.
          Hide
          Patrick Hunt added a comment -

          Ok, here's whats' happening wrt the test not failing.

          Look after the time index I provided, notice that the followers fail to hear from the leader, at which point they re-negotiate the quorum, at which point the quorum IS RE-EST SUCCESSFULLY, and the clients are now able to connect to the service and the test continues.

          This is a side-effect of increasing the "waitForAll" time that we allow. We are giving time for the quorum to re-establish itself after the "hang".

          Likely this is why we have not seen this in the past, if it did every happen (in any test) it might have been ignored due to recovery.

          so now that we know the test is really failing, perhaps we should just go back to 10 iterations/1sec sleep so that the test will actually fail. (or perhaps there's a better way to address this in the test?)

          Show
          Patrick Hunt added a comment - Ok, here's whats' happening wrt the test not failing. Look after the time index I provided, notice that the followers fail to hear from the leader, at which point they re-negotiate the quorum, at which point the quorum IS RE-EST SUCCESSFULLY, and the clients are now able to connect to the service and the test continues. This is a side-effect of increasing the "waitForAll" time that we allow. We are giving time for the quorum to re-establish itself after the "hang". Likely this is why we have not seen this in the past, if it did every happen (in any test) it might have been ignored due to recovery. so now that we know the test is really failing, perhaps we should just go back to 10 iterations/1sec sleep so that the test will actually fail. (or perhaps there's a better way to address this in the test?)
          Hide
          Flavio Junqueira added a comment -

          Here is some progress. I was actually looking at the wrong snippet. The correct one was the NEWLEADER handler:

                      case Leader.NEWLEADER: // it will be NEWLEADER in v1.0
                              zk.takeSnapshot();
                              snapshotTaken = true;
                              writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true);
                              break;
                          }
          
          

          We also take a snapshot here and by looking at the stack trace that Pat posted, we see that the learner handlers are stuck in the loop right after receiving the ack, which essentially waits for the leader to start. By the same stack trace, the leader is not starting because it is waiting for the followers to acknowledge the NEWLEADER message... but the followers have acknowledged the NEWLEADER message, otherwise the learner handlers wouldn't be executing that loop (Line 450). Unless I'm missing anything, the problem must be in Leader.processAck.

          Show
          Flavio Junqueira added a comment - Here is some progress. I was actually looking at the wrong snippet. The correct one was the NEWLEADER handler: case Leader.NEWLEADER: // it will be NEWLEADER in v1.0 zk.takeSnapshot(); snapshotTaken = true; writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true); break; } We also take a snapshot here and by looking at the stack trace that Pat posted, we see that the learner handlers are stuck in the loop right after receiving the ack, which essentially waits for the leader to start. By the same stack trace, the leader is not starting because it is waiting for the followers to acknowledge the NEWLEADER message... but the followers have acknowledged the NEWLEADER message, otherwise the learner handlers wouldn't be executing that loop (Line 450). Unless I'm missing anything, the problem must be in Leader.processAck.
          Hide
          Flavio Junqueira added a comment -

          Here are the two cases that would cause the ack not to be added without generating any log message:

                  if (outstandingProposals.size() == 0) {
                      if (LOG.isDebugEnabled()) {
                          LOG.debug("outstanding is 0");
                      }
                      return;
                  }
                  if (lastCommitted >= zxid) {
                      if (LOG.isDebugEnabled()) {
                          LOG.debug("proposal has already been committed, pzxid:"
                                  + lastCommitted
                                  + " zxid: 0x" + Long.toHexString(zxid));
                      }
                      // The proposal has already been committed
                      return;
                  }
          
          

          I'm trying to decide which one is likely to be causing it to return.

          Show
          Flavio Junqueira added a comment - Here are the two cases that would cause the ack not to be added without generating any log message: if (outstandingProposals.size() == 0) { if (LOG.isDebugEnabled()) { LOG.debug("outstanding is 0"); } return; } if (lastCommitted >= zxid) { if (LOG.isDebugEnabled()) { LOG.debug("proposal has already been committed, pzxid:" + lastCommitted + " zxid: 0x" + Long.toHexString(zxid)); } // The proposal has already been committed return; } I'm trying to decide which one is likely to be causing it to return.
          Hide
          Flavio Junqueira added a comment -

          My latest conjecture is that we are getting acks for NEWLEADER before we have a chance to get it into the outstandingProposals queue, which would cause the ack to be dropped and the behavior we are observing. I tried to move the code in lead that adds NEWLEADER to the queue to before the point where we start the cnxAcceptor, but it broke a lot of tests, so I don't think this is the right approach.

          Pat, if you want to run again, I would suggest to put two info messages for the cases I posted last in Leader.processAck() to determine exactly which case is being triggered and preventing the leader from collecting the ack.

          Show
          Flavio Junqueira added a comment - My latest conjecture is that we are getting acks for NEWLEADER before we have a chance to get it into the outstandingProposals queue, which would cause the ack to be dropped and the behavior we are observing. I tried to move the code in lead that adds NEWLEADER to the queue to before the point where we start the cnxAcceptor, but it broke a lot of tests, so I don't think this is the right approach. Pat, if you want to run again, I would suggest to put two info messages for the cases I posted last in Leader.processAck() to determine exactly which case is being triggered and preventing the leader from collecting the ack.
          Hide
          Patrick Hunt added a comment -

          I'm running that on my cluster now. will report results.

          Show
          Patrick Hunt added a comment - I'm running that on my cluster now. will report results.
          Hide
          Patrick Hunt added a comment -

          This was printed a few times just before the failure:

              [junit] 2011-11-04 13:23:20,001 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:51057:Leader@478] - PDH outstandingProposals.size() is ZERO
          

          See attachment testEarlyLeaderAbandonment4

          Show
          Patrick Hunt added a comment - This was printed a few times just before the failure: [junit] 2011-11-04 13:23:20,001 [myid:2] - INFO [LearnerHandler-/127.0.0.1:51057:Leader@478] - PDH outstandingProposals.size() is ZERO See attachment testEarlyLeaderAbandonment4
          Hide
          Patrick Hunt added a comment -

          I'm not sure that's conclusive though, this is the patch, if I apply it on my local system the test passes, but the logs still contain this ZERO outstanding message:

          diff --git a/src/java/main/org/apache/zookeeper/server/quorum/Leader.java b/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
          index a44f1ee..a4cc127 100644
          --- a/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
          +++ b/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
          @@ -475,12 +475,14 @@ public class Leader {
                   }
           
                   if (outstandingProposals.size() == 0) {
          +            LOG.info("PDH outstandingProposals.size() is ZERO");
                       if (LOG.isDebugEnabled()) {
                           LOG.debug("outstanding is 0");
                       }
                       return;
                   }
                   if (lastCommitted >= zxid) {
          +            LOG.info("PDH lastCommitted = " + Long.toHexString(lastCommitted) + " zxid = " + Long.toHexString(zxid));
                       if (LOG.isDebugEnabled()) {
                           LOG.debug("proposal has already been committed, pzxid:"
                                   + lastCommitted
          
          Show
          Patrick Hunt added a comment - I'm not sure that's conclusive though, this is the patch, if I apply it on my local system the test passes, but the logs still contain this ZERO outstanding message: diff --git a/src/java/main/org/apache/zookeeper/server/quorum/Leader.java b/src/java/main/org/apache/zookeeper/server/quorum/Leader.java index a44f1ee..a4cc127 100644 --- a/src/java/main/org/apache/zookeeper/server/quorum/Leader.java +++ b/src/java/main/org/apache/zookeeper/server/quorum/Leader.java @@ -475,12 +475,14 @@ public class Leader { } if (outstandingProposals.size() == 0) { + LOG.info("PDH outstandingProposals.size() is ZERO"); if (LOG.isDebugEnabled()) { LOG.debug("outstanding is 0"); } return; } if (lastCommitted >= zxid) { + LOG.info("PDH lastCommitted = " + Long.toHexString(lastCommitted) + " zxid = " + Long.toHexString(zxid)); if (LOG.isDebugEnabled()) { LOG.debug("proposal has already been committed, pzxid:" + lastCommitted
          Hide
          Flavio Junqueira added a comment -

          It is ok to have the zero outstanding message, since the request might have been already acknowledged by a quorum. It is a problem if you get it and it hasn't been acknowledged by a quorum, which corresponds to the case I mentioned above.

          Show
          Flavio Junqueira added a comment - It is ok to have the zero outstanding message, since the request might have been already acknowledged by a quorum. It is a problem if you get it and it hasn't been acknowledged by a quorum, which corresponds to the case I mentioned above.
          Hide
          Patrick Hunt added a comment -

          why do I see this in my local test log then? (passes)

          2011-11-04 13:33:12,979 [myid:0] - INFO  [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11227:FileTxnSnapLog@255] - Snapshotting: 0
          2011-11-04 13:33:12,979 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:43739:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 1
          2011-11-04 13:33:12,981 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:43740:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 0
          2011-11-04 13:33:12,985 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:43740:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 0
          <then immediately a client connects successfully>
          

          This is at the same point in the log where we see the crash (but this is a passing version)

          Show
          Patrick Hunt added a comment - why do I see this in my local test log then? (passes) 2011-11-04 13:33:12,979 [myid:0] - INFO [QuorumPeer[myid=0]/0:0:0:0:0:0:0:0:11227:FileTxnSnapLog@255] - Snapshotting: 0 2011-11-04 13:33:12,979 [myid:2] - INFO [LearnerHandler-/127.0.0.1:43739:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 1 2011-11-04 13:33:12,981 [myid:2] - INFO [LearnerHandler-/127.0.0.1:43740:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 0 2011-11-04 13:33:12,985 [myid:2] - INFO [LearnerHandler-/127.0.0.1:43740:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 0 <then immediately a client connects successfully> This is at the same point in the log where we see the crash (but this is a passing version)
          Hide
          Patrick Hunt added a comment -

          here's more log information for a passing case. I added a new LOG after the size==0 check to check for the case where we get an ack and there is something outstanding.

          Notice each follower is sending 2 acks. ack!

          2011-11-04 14:35:43,549 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11230:FileTxnSnapLog@255] - Snapshotting: 0
          2011-11-04 14:35:43,549 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:45594:Leader@484] - PDH outstandingProposals.size() is NOT zero zxid = 100000000 sid = 0 followerAddr = /127.0.0.1:11234
          2011-11-04 14:35:43,549 [myid:2] - WARN  [LearnerHandler-/127.0.0.1:45594:Leader@509] - Commiting zxid 0x100000000 from /127.0.0.1:11234 not first!
          2011-11-04 14:35:43,550 [myid:2] - WARN  [LearnerHandler-/127.0.0.1:45594:Leader@511] - First is 0
          2011-11-04 14:35:43,550 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:45594:Leader@535] - Have quorum of supporters; starting up and setting last processed zxid: 4294967296
          2011-11-04 14:35:43,566 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:45595:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 1 followerAddr = /127.0.0.1:11234
          2011-11-04 14:35:43,569 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:45594:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 0 followerAddr = /127.0.0.1:11234
          2011-11-04 14:35:43,570 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:45595:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 1 followerAddr = /127.0.0.1:11234
          
          Show
          Patrick Hunt added a comment - here's more log information for a passing case. I added a new LOG after the size==0 check to check for the case where we get an ack and there is something outstanding. Notice each follower is sending 2 acks. ack! 2011-11-04 14:35:43,549 [myid:1] - INFO [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:11230:FileTxnSnapLog@255] - Snapshotting: 0 2011-11-04 14:35:43,549 [myid:2] - INFO [LearnerHandler-/127.0.0.1:45594:Leader@484] - PDH outstandingProposals.size() is NOT zero zxid = 100000000 sid = 0 followerAddr = /127.0.0.1:11234 2011-11-04 14:35:43,549 [myid:2] - WARN [LearnerHandler-/127.0.0.1:45594:Leader@509] - Commiting zxid 0x100000000 from /127.0.0.1:11234 not first! 2011-11-04 14:35:43,550 [myid:2] - WARN [LearnerHandler-/127.0.0.1:45594:Leader@511] - First is 0 2011-11-04 14:35:43,550 [myid:2] - INFO [LearnerHandler-/127.0.0.1:45594:Leader@535] - Have quorum of supporters; starting up and setting last processed zxid: 4294967296 2011-11-04 14:35:43,566 [myid:2] - INFO [LearnerHandler-/127.0.0.1:45595:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 1 followerAddr = /127.0.0.1:11234 2011-11-04 14:35:43,569 [myid:2] - INFO [LearnerHandler-/127.0.0.1:45594:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 0 followerAddr = /127.0.0.1:11234 2011-11-04 14:35:43,570 [myid:2] - INFO [LearnerHandler-/127.0.0.1:45595:Leader@478] - PDH outstandingProposals.size() is ZERO zxid = 100000000 sid = 1 followerAddr = /127.0.0.1:11234
          Hide
          Camille Fournier added a comment -

          2 acks is expected. This threw me the first time I saw it in the code, but it's right as far as I could reason looking through follower and leader, the first ack is after NEWLEADER, the second is right before we start the zk server.

          Show
          Camille Fournier added a comment - 2 acks is expected. This threw me the first time I saw it in the code, but it's right as far as I could reason looking through follower and leader, the first ack is after NEWLEADER, the second is right before we start the zk server.
          Hide
          Flavio Junqueira added a comment -

          With this patch, we make sure that the newleader message is queued before we process acks to newleader.

          Show
          Flavio Junqueira added a comment - With this patch, we make sure that the newleader message is queued before we process acks to newleader.
          Hide
          Flavio Junqueira added a comment -

          Camille, It sounds right that we need multiple acks from the follower, but it sounds awkward that they are acknowledging the same message. Perhaps this is to guarantee a correct implementation and for backward compatibility?

          Show
          Flavio Junqueira added a comment - Camille, It sounds right that we need multiple acks from the follower, but it sounds awkward that they are acknowledging the same message. Perhaps this is to guarantee a correct implementation and for backward compatibility?
          Hide
          Patrick Hunt added a comment -

          it's running in my test harness now (both camille's patch and fpj)

          Show
          Patrick Hunt added a comment - it's running in my test harness now (both camille's patch and fpj)
          Hide
          Alexander Shraer added a comment -

          Please note that https://issues.apache.org/jira/browse/ZOOKEEPER-1194 fixes the problem of NEWLEADER message not being in outstandingProposals when NEWLEADER acks arrive.

          I haven't followed ZK-1270 closely, but just thought you should be aware of ZK-1194

          Show
          Alexander Shraer added a comment - Please note that https://issues.apache.org/jira/browse/ZOOKEEPER-1194 fixes the problem of NEWLEADER message not being in outstandingProposals when NEWLEADER acks arrive. I haven't followed ZK-1270 closely, but just thought you should be aware of ZK-1194
          Hide
          Patrick Hunt added a comment -

          Thanks Alexander. How does that patch compare to this? Esp wrt risk?

          Do you think we should choose 1194 over this change?

          Flavio?

          Show
          Patrick Hunt added a comment - Thanks Alexander. How does that patch compare to this? Esp wrt risk? Do you think we should choose 1194 over this change? Flavio?
          Hide
          Alexander Shraer added a comment -

          I think that ZK-1194 risk is very low. It modifies 2 lines of code in Leader.java to make sure that the leader is in connectingFollowers (electingFollowers) before returning from getEpochToPropose (waitForEpochAck). It includes 2 tests for these two changes.

          According to my understanding, ZK-1194 patch can replace all changes to Leader and LearnerHandler that appear in the patches above for ZK-1270. Specifically with ZK-1194 in it is not necessary to explicitly check that NEWLEADER message got into the outstandingProposals, and to wait for this condition in LeaderHandler.

          I'm not sure what was the purpose of the changes to the other files in ZK-1270 but I'm pretty sure ZK-1194 addresses the same issue as the changes to Leader and LearnerHandler made here. Also - ZK-1194 handles another possible race condition not addressed in ZK-1270.

          Show
          Alexander Shraer added a comment - I think that ZK-1194 risk is very low. It modifies 2 lines of code in Leader.java to make sure that the leader is in connectingFollowers (electingFollowers) before returning from getEpochToPropose (waitForEpochAck). It includes 2 tests for these two changes. According to my understanding, ZK-1194 patch can replace all changes to Leader and LearnerHandler that appear in the patches above for ZK-1270. Specifically with ZK-1194 in it is not necessary to explicitly check that NEWLEADER message got into the outstandingProposals, and to wait for this condition in LeaderHandler. I'm not sure what was the purpose of the changes to the other files in ZK-1270 but I'm pretty sure ZK-1194 addresses the same issue as the changes to Leader and LearnerHandler made here. Also - ZK-1194 handles another possible race condition not addressed in ZK-1270.
          Hide
          Mahadev konar added a comment -

          Alex/Pat,
          I'd really prefer the patch on this jira for now. Alex, I am in the middle of making a release. The patch on ZOOKEEPER-1194 does not apply and I really would not like to wait for consensus on the fix on ZOOKEEEPR-1194. We can definitely put the fix in ZOOKEEPER-1194 in 3.5 or 3.4.1. Fair enough?

          Show
          Mahadev konar added a comment - Alex/Pat, I'd really prefer the patch on this jira for now. Alex, I am in the middle of making a release. The patch on ZOOKEEPER-1194 does not apply and I really would not like to wait for consensus on the fix on ZOOKEEEPR-1194. We can definitely put the fix in ZOOKEEPER-1194 in 3.5 or 3.4.1. Fair enough?
          Hide
          Alexander Shraer added a comment -

          Hi Mahadev,

          I understand, I'm only speaking up here because the patches overlap. When ZK-1194 goes in the changes to Leader and LearnerHandler from ZK-1270 become unnecessary.

          Actually, now that I looked closer on ZK-1270 I realized that the changes to Leader.java and LearnerHandler.java are the only code changes here and the rest are tests. So ZK-1270 is subsumed by ZK-1194.

          I think it would be best to adopt the code changes from ZK-1194 (changes to 2 lines of code) and merge the tests in these patches. But this is up to you of course.

          Alex

          Show
          Alexander Shraer added a comment - Hi Mahadev, I understand, I'm only speaking up here because the patches overlap. When ZK-1194 goes in the changes to Leader and LearnerHandler from ZK-1270 become unnecessary. Actually, now that I looked closer on ZK-1270 I realized that the changes to Leader.java and LearnerHandler.java are the only code changes here and the rest are tests. So ZK-1270 is subsumed by ZK-1194. I think it would be best to adopt the code changes from ZK-1194 (changes to 2 lines of code) and merge the tests in these patches. But this is up to you of course. Alex
          Hide
          Alexander Shraer added a comment -

          I'm attaching an alternative patch to this.

          Show
          Alexander Shraer added a comment - I'm attaching an alternative patch to this.
          Hide
          Mahadev konar added a comment -

          Alex,
          Can you please upload a patch that applies to trunk and 3.4 branch here? I'd like to get this done tonight.

          Show
          Mahadev konar added a comment - Alex, Can you please upload a patch that applies to trunk and 3.4 branch here? I'd like to get this done tonight.
          Hide
          Mahadev konar added a comment -

          Alex,
          Please make sure that you grant code changes to Apache. You just have to click on the box that says "Grant license to Apache" when attaching the patch.

          Do reattach the patch with the grant. thanks

          Show
          Mahadev konar added a comment - Alex, Please make sure that you grant code changes to Apache. You just have to click on the box that says "Grant license to Apache" when attaching the patch. Do reattach the patch with the grant. thanks
          Hide
          Camille Fournier added a comment -

          If readyToStart becomes unused with this patch can we please go ahead and remove it?

          Show
          Camille Fournier added a comment - If readyToStart becomes unused with this patch can we please go ahead and remove it?
          Hide
          Patrick Hunt added a comment -

          note: flavio's patch is causing the following 11 tests to fail consistently:

          org.apache.zookeeper.test.DisconnectedWatcherTest.testChildWatcherAutoResetWithChroot
          org.apache.zookeeper.test.DisconnectedWatcherTest.testDefaultWatcherAutoResetWithChroot
          org.apache.zookeeper.test.DisconnectedWatcherTest.testDeepChildWatcherAutoResetWithChroot
          org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetWithGlobal
          org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetWithLocal
          org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetDisabledWithGlobal
          org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetDisabledWithLocal
          org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetWithGlobal
          org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetWithLocal
          org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetDisabledWithGlobal
          org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetDisabledWithLocal
          
          Show
          Patrick Hunt added a comment - note: flavio's patch is causing the following 11 tests to fail consistently: org.apache.zookeeper.test.DisconnectedWatcherTest.testChildWatcherAutoResetWithChroot org.apache.zookeeper.test.DisconnectedWatcherTest.testDefaultWatcherAutoResetWithChroot org.apache.zookeeper.test.DisconnectedWatcherTest.testDeepChildWatcherAutoResetWithChroot org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetWithGlobal org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetWithLocal org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetDisabledWithGlobal org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetDisabledWithLocal org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetWithGlobal org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetWithLocal org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetDisabledWithGlobal org.apache.zookeeper.test.WatcherTest.testWatcherAutoResetDisabledWithLocal
          Hide
          Patrick Hunt added a comment -

          Note: these same tests are failing with the 1270-1194 patch.

          Show
          Patrick Hunt added a comment - Note: these same tests are failing with the 1270-1194 patch.
          Hide
          Alexander Shraer added a comment -

          Camille, we should remove readyToStart, but it is still used in Zab1_0Test, which uses it to check that the leader is ready and able to receive connections. We can most probably check this some other way, but I suggest to leave this for now and fix later.

          Show
          Alexander Shraer added a comment - Camille, we should remove readyToStart, but it is still used in Zab1_0Test, which uses it to check that the leader is ready and able to receive connections. We can most probably check this some other way, but I suggest to leave this for now and fix later.
          Hide
          Mahadev konar added a comment -

          +1 on Alex's suggestion. Lets stick to minimal changes for now .

          Show
          Mahadev konar added a comment - +1 on Alex's suggestion. Lets stick to minimal changes for now .
          Hide
          Camille Fournier added a comment -

          without my clientbase junk

          Show
          Camille Fournier added a comment - without my clientbase junk
          Hide
          Camille Fournier added a comment -

          branch 3.4 version of clean patch

          Show
          Camille Fournier added a comment - branch 3.4 version of clean patch
          Hide
          Mahadev konar added a comment -

          Just committed Camille's final patch to trunk and 3.4. Thanks to Camille/Flavio/Alex/Pat on this! A true community patch .

          Show
          Mahadev konar added a comment - Just committed Camille's final patch to trunk and 3.4. Thanks to Camille/Flavio/Alex/Pat on this! A true community patch .
          Hide
          Flavio Junqueira added a comment -

          Sorry for missing part of the fun.

          Alex, Thanks for spotting the duplicate, that was a great catch. I'm not convinced that your patch completely solves the problem, though. LearnerHandler sends NEWLEADER concurrently with the leader adding NEWLEADER to outstandingRequests, so even fixing the barriers as you did does not prevent the race I mentioned above, I think.

          I must say that I didn't find my proposed patch very elegant, but I'm not entirely sure that yours covers all cases, so let me know what you think.

          Show
          Flavio Junqueira added a comment - Sorry for missing part of the fun. Alex, Thanks for spotting the duplicate, that was a great catch. I'm not convinced that your patch completely solves the problem, though. LearnerHandler sends NEWLEADER concurrently with the leader adding NEWLEADER to outstandingRequests, so even fixing the barriers as you did does not prevent the race I mentioned above, I think. I must say that I didn't find my proposed patch very elegant, but I'm not entirely sure that yours covers all cases, so let me know what you think.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #1356 (See https://builds.apache.org/job/ZooKeeper-trunk/1356/)
          ZOOKEEPER-1270. testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. (Flavio, Camille and Alexander Shraer via mahadev)

          mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197891
          Files :

          • /zookeeper/trunk/CHANGES.txt
          • /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java
          • /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java
          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #1356 (See https://builds.apache.org/job/ZooKeeper-trunk/1356/ ) ZOOKEEPER-1270 . testEarlyLeaderAbandonment failing intermittently, quorum formed, no serving. (Flavio, Camille and Alexander Shraer via mahadev) mahadev : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1197891 Files : /zookeeper/trunk/CHANGES.txt /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/quorum/Leader.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/QuorumPeerMainTest.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/server/quorum/Zab1_0Test.java /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/ClientBase.java
          Hide
          Alexander Shraer added a comment -

          Hi Flavio,

          LearnerHandler sends NEWLEADER messages after running waitForEpochAck.
          The patch makes sure that waitForEpochAck does not return until the leader also runs it. Leader runs waitForEpochAck after inserting NEWLEADER message into the outstandingProposals.

          Alex

          Show
          Alexander Shraer added a comment - Hi Flavio, LearnerHandler sends NEWLEADER messages after running waitForEpochAck. The patch makes sure that waitForEpochAck does not return until the leader also runs it. Leader runs waitForEpochAck after inserting NEWLEADER message into the outstandingProposals. Alex
          Hide
          Flavio Junqueira added a comment -

          Sounds right, thanks for the clarification.

          Show
          Flavio Junqueira added a comment - Sounds right, thanks for the clarification.
          Hide
          Patrick Hunt added a comment -

          this is still failing on 3.4 branch (incl this latest patch).

          See testEarlyLeaderAbandonment5.txt.gz (attached).

          Notice now that the stat is continually run, without the clients attempting to connect to the server. The test doesn't seem to be cutting over from detecting the service is up to starting the clients.

          Show
          Patrick Hunt added a comment - this is still failing on 3.4 branch (incl this latest patch). See testEarlyLeaderAbandonment5.txt.gz (attached). Notice now that the stat is continually run, without the clients attempting to connect to the server. The test doesn't seem to be cutting over from detecting the service is up to starting the clients.

            People

            • Assignee:
              Flavio Junqueira
              Reporter:
              Patrick Hunt
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development