Uploaded image for project: 'ZooKeeper'
  1. ZooKeeper
  2. ZOOKEEPER-2766

Quorum fails with java.io.EOFException

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.3
    • None
    • leaderElection, quorum
    • None
    • CentOS-7, Docker version 17.03.1-ce

    Description

      When I start a ZooKeeper ensemble comprising 3 nodes, I'm currently facing the following behavior:
      Two nodes (let's say node 2 and 3) out of the three start their own quorum, and finally one of them is elected the new leader (node 3), while the other one becomes the follower (node 2). Node 1 seems to be able to establish a connection to node 3 (elected leader) in my case, but this seems to fail for node 2.
      Node 1 shows the following in its logs:

      2017-04-25 09:24:02,806 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1055] - LOOKING
      2017-04-25 09:24:02,808 [myid:1] - INFO [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FastLeaderElection@894] - New election. My id = 1, proposed zxid=0x0
      2017-04-25 09:24:02,811 [myid:1] - INFO [WorkerReceiver[myid=1]:FastLeaderElection@688] - Notification: 2 (message format version), 1 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 1 (n.sid), 0x0 (
      n.peerEPoch), LOOKING (my state)0 (n.config version)
      2017-04-25 09:24:02,817 [myid:1] - WARN [WorkerSender[myid=1]:QuorumCnxManager@457] - Cannot open channel to 2 at election address /9.152.171.98:3888
      java.net.ConnectException: Connection refused (Connection refused)
      at java.net.PlainSocketImpl.socketConnect(Native Method)
      at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
      at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
      at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
      at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      at java.net.Socket.connect(Socket.java:589)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:443)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:486)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:421)
      at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:486)
      at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:465)
      at java.lang.Thread.run(Thread.java:745)
      2017-04-25 09:24:02,822 [myid:1] - INFO [WorkerSender[myid=1]:QuorumCnxManager@278] - Have smaller server identifier, so dropping the connection: (3, 1)
      2017-04-25 09:24:03,025 [myid:1] - WARN [QuorumPeer[myid=1](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager@457] - Cannot open channel to 2 at election address /9.152.171.98:3888

      However, that's not all, since the quorum consisting of node 2 and 3 does not work properly. The nodes' logs tell that leader election between these two works fine.
      Here's what node 3 (leader) says:

      2017-04-25 09:09:33,842 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 3 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
      2017-04-25 09:09:33,844 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message format version), 2 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
      2017-04-25 09:09:33,851 [myid:3] - INFO [WorkerReceiver[myid=3]:FastLeaderElection@688] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (n.peerEPoch), LOOKING (my state)0 (n.config version)
      2017-04-25 09:09:34,051 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id3,name1=replica.3,name2=LeaderElection]
      2017-04-25 09:09:34,052 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1143] - LEADING
      2017-04-25 09:09:34,055 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@63] - TCP NoDelay set to: true
      2017-04-25 09:09:34,055 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@83] - zookeeper.leader.maxConcurrentSnapshots = 10
      2017-04-25 09:09:34,056 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@85] - zookeeper.leader.maxConcurrentSnapshotTimeout = 5

      And here's the output node 2 (follower) provides:

      2017-04-25 09:09:31,875 [myid:2] - INFO [WorkerReceiver[myid=2]:FastLeaderElection@688] - Notification: 2 (message format version), 3 (n.leader), 0x0 (n.zxid), 0x1 (n.round), LOOKING (n.state), 2 (n.sid), 0x0 (
      n.peerEPoch), LOOKING (my state)0 (n.config version)
      2017-04-25 09:09:32,077 [myid:2] - INFO [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):MBeanRegistry@128] - Unregister MBean [org.apache.ZooKeeperService:name0=ReplicatedServer_id2,name1=repl
      ica.2,name2=LeaderElection]
      2017-04-25 09:09:32,077 [myid:2] - INFO [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1131] - FOLLOWING
      2017-04-25 09:09:32,082 [myid:2] - INFO [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@88] - TCP NoDelay set to: true

      So far, so good. But seconds later the connection between node 2 and 3 seems to get lost, causing leader node 3 to report an EOFExeption. If I understand the logs correctly, node 2 (follower) properly closes the connection (sending "Goodbye"), whilst node 3 says that the socket is still open.

      2017-04-25 09:09:34,190 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@414] - LEADING - LEADER ELECTION TOOK - 138 MS
      2017-04-25 09:09:34,197 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):FileTxnSnapLog@320] - Snapshotting: 0x0 to /data/version-2/snapshot.0
      2017-04-25 09:09:35,076 [myid:3] - INFO [LearnerHandler-/9.152.171.98:51328:LearnerHandler@382] - Follower sid: 2 : info : 9.152.171.98:2888:3888:participant;0.0.0.0:2181
      2017-04-25 09:09:35,113 [myid:3] - INFO [LearnerHandler-/9.152.171.98:51328:LearnerHandler@683] - Synchronizing with Follower sid: 2 maxCommittedLog=0x0 minCommittedLog=0x0 lastProcessedZxid=0x0 peerLastZxid=0x
      0
      2017-04-25 09:09:35,114 [myid:3] - INFO [LearnerHandler-/9.152.171.98:51328:LearnerHandler@727] - Sending DIFF zxid=0x0 for peer sid: 2
      2017-04-25 09:09:35,133 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@1258] - Have quorum of supporters, sids: [ [2, 3],[2, 3] ]; starting up and setting last processed zxid: 0x100000000
      2017-04-25 09:09:35,169 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):CommitProcessor@255] - Configuring CommitProcessor with 2 worker threads.
      2017-04-25 09:09:35,179 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):ContainerManager@64] - Using checkIntervalMs=60000 maxPerMinute=10000
      2017-04-25 09:09:35,196 [myid:3] - WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@919] - Connection broken for id 2, my id = 3, error =
      java.io.EOFException
      at java.io.DataInputStream.readInt(DataInputStream.java:392)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)
      2017-04-25 09:09:35,196 [myid:3] - WARN [RecvWorker:2:QuorumCnxManager$RecvWorker@922] - Interrupting SendWorker
      2017-04-25 09:09:35,197 [myid:3] - WARN [SendWorker:2:QuorumCnxManager$SendWorker@836] - Interrupted while waiting for message on queue
      java.lang.InterruptedException
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
      at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
      2017-04-25 09:09:35,197 [myid:3] - WARN [SendWorker:2:QuorumCnxManager$SendWorker@845] - Send worker leaving thread id 2 my id = 3
      2017-04-25 09:09:35,204 [myid:3] - ERROR [LearnerHandler-/9.152.171.98:51328:LearnerHandler@604] - Unexpected exception causing shutdown while sock still open
      java.io.EOFException
      at java.io.DataInputStream.readInt(DataInputStream.java:392)
      at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
      at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
      at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
      at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:515)
      2017-04-25 09:09:35,204 [myid:3] - WARN [LearnerHandler-/9.152.171.98:51328:LearnerHandler@619] - ******* GOODBYE /9.152.171.98:51328 ********
      2017-04-25 09:09:37,181 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@626] - Shutting down
      2017-04-25 09:09:37,182 [myid:3] - INFO [QuorumPeer[myid=3](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Leader@632] - Shutdown called
      java.lang.Exception: shutdown Leader! reason: Not sufficient followers synced, only synced with sids: [ [3] ]
      at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:632)
      at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:612)
      at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1146)

      Unfortunately, node 2 does not provide any additional information on what exactly is going on. After leader election, the only thing it reports is this:

      2017-04-25 09:09:32,091 [myid:2] - INFO [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Follower@68] - FOLLOWING - LEADER ELECTION TOOK - 13 MS
      2017-04-25 09:09:32,094 [myid:2] - WARN [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@273] - Unexpected exception, tries=0, remaining init limit=9999, connecting to /9.152.171.12:288
      8
      java.net.ConnectException: Connection refused (Connection refused)
      at java.net.PlainSocketImpl.socketConnect(Native Method)
      at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
      at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
      at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
      at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
      at java.net.Socket.connect(Socket.java:589)
      at org.apache.zookeeper.server.quorum.Learner.sockConnect(Learner.java:227)
      at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:256)
      at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:76)
      at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1133)
      2017-04-25 09:09:33,142 [myid:2] - INFO [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@369] - Getting a diff from the leader 0x0
      2017-04-25 09:09:33,146 [myid:2] - INFO [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@516] - Learner received NEWLEADER message
      2017-04-25 09:09:33,207 [myid:2] - INFO [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):Learner@499] - Learner received UPTODATE message
      2017-04-25 09:09:33,220 [myid:2] - WARN [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumPeer@1446] - Restarting Leader Election
      2017-04-25 09:09:33,221 [myid:2] - INFO [/0.0.0.0:3888:QuorumCnxManager$Listener@665] - Leaving listener
      2017-04-25 09:09:33,222 [myid:2] - WARN [SendWorker:3:QuorumCnxManager$SendWorker@836] - Interrupted while waiting for message on queue
      java.lang.InterruptedException
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
      at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:986)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:824)
      2017-04-25 09:09:33,222 [myid:2] - WARN [RecvWorker:3:QuorumCnxManager$RecvWorker@919] - Connection broken for id 3, my id = 2, error =
      java.net.SocketException: Socket closed
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      at java.net.SocketInputStream.read(SocketInputStream.java:171)
      at java.net.SocketInputStream.read(SocketInputStream.java:141)
      at java.net.SocketInputStream.read(SocketInputStream.java:224)
      at java.io.DataInputStream.readInt(DataInputStream.java:387)
      at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:904)

      As far as I get that, node 2 wants to start a new leader election, but fails to establish a connection to the other nodes. It tries over and over, finally ending up in a timeout. Unfortunately, this doesn't give me any hint on what exactly breaks up the connection between the follower node (node 2) and the leader node (node 3) and why it can be re-established.

      It might also be relevant that I'm running ZooKeeper in Docker containers, using the host network option.

      Attachments

        1. logs-node3.txt
          188 kB
          Patrick Kleindienst
        2. logs-node2.txt
          221 kB
          Patrick Kleindienst
        3. logs-node1.txt
          213 kB
          Patrick Kleindienst

        Activity

          People

            Unassigned Unassigned
            pkleindienst Patrick Kleindienst
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: