Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Invalid
-
3.4.5
-
None
-
None
-
None
Description
I tried looking through existing JIRA for something like this, but the closest I came was ZOOKEEPER-2104. It looks very similar, but I don't know if it really is the same thing. Essentially we had a 5 node ensemble for a large storm cluster. For a few of the nodes at the same time they get an error that looks like.
WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@762] - Connection broken for id 2, my id = 4, error = java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:747) WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@765] - Interrupting SendWorker WARN [SendWorker:2:QuorumCnxManager$SendWorker@679] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389) at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:831) at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:62) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:667) WARN [SendWorker:2:QuorumCnxManager$SendWorker@688] - Send worker leaving thread WARN [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@89] - Exception when following the leader java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:189) at java.net.SocketInputStream.read(SocketInputStream.java:121) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152) at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740) INFO [QuorumPeer[myid=4]/0.0.0.0:50512:Follower@166] - shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
After that all of the connections are shut down
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:NIOServerCnxn@1001] - Closed socket connection for client ...
but it does not manage to have the JVM shut down
INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerZooKeeperServer@139] - Shutting down INFO [QuorumPeer[myid=4]/0.0.0.0:50512:ZooKeeperServer@419] - shutting down INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FollowerRequestProcessor@105] - Shutting down INFO [QuorumPeer[myid=4]/0.0.0.0:50512:CommitProcessor@181] - Shutting down INFO [FollowerRequestProcessor:4:FollowerRequestProcessor@95] - FollowerRequestProcessor exited loop! INFO [QuorumPeer[myid=4]/0.0.0.0:50512:FinalRequestProcessor@415] - shutdown of request processor complete INFO [CommitProcessor:4:CommitProcessor@150] - CommitProcessor exited loop! WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:50512:NIOServerCnxn@1001] - Closed socket connection for client /... (no session established for client) INFO [QuorumPeer[myid=4]/0.0.0.0:50512:SyncRequestProcessor@175] - Shutting down INFO [SyncThread:4:SyncRequestProcessor@155] - SyncRequestProcessor exited! INFO [QuorumPeer[myid=4]/0.0.0.0:50512:QuorumPeer@670] - LOOKING
after that all connections to that node initiate, and then are shut down with ZooKeeperServer not running. It seems to stay in this state indefinitely until the process is manually restarted. After that it recovers.
We have seen this happen on multiple servers at the same time resulting in the entire ensemble being unusable.