Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Not A Problem
    • Affects Version/s: 3.3.0, 3.3.3
    • Fix Version/s: None
    • Component/s: quorum
    • Labels:
      None
    • Environment:

      linux rhel 4, x64, java version 1.4.2

      Description

      exception causing shutdownthere are 5 members in the quorum. one follower can not sync with leader after restart. it seems leader has close the data connection with this follower because of read timeout. here is the key log in follower:

      2011-06-30 22:14:45,069 - WARN  [Thread-17:QuorumCnxManager$RecvWorker@658] - Connection broken: 
      java.nio.channels.ClosedChannelException
              at sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:113)
              at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:156)
              at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:629)
      2011-06-30 22:14:45,069 - INFO  [QuorumPeer:/0.0.0.0:2181:FastLeaderElection@689] - Notification: 3, 17198470148, 3, 3, LOOKING, LOOKING, 3
      2011-06-30 22:14:45,070 - ERROR [Thread-16:QuorumCnxManager$SendWorker@559] - Failed to send last message. Shutting down thread.
      java.nio.channels.ClosedChannelException
              at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
              at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
              at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.send(QuorumCnxManager.java:548)
              at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:557)
      2011-06-30 22:14:45,082 - INFO  [QuorumPeer:/0.0.0.0:2181:Learner@282] - Getting a diff from the leader 0x4011bd462
      2011-06-30 22:14:45,083 - WARN  [Thread-18:QuorumCnxManager$SendWorker@589] - Send worker leaving thread
      2011-06-30 22:14:45,085 - WARN  [QuorumPeer:/0.0.0.0:2181:Follower@116] - Got zxid 0x4011bd405 expected 0x1
      2011-06-30 22:14:45,090 - INFO  [QuorumPeer:/0.0.0.0:2181:FileTxnSnapLog@208] - Snapshotting: 4011bd462
      2011-06-30 22:14:53,397 - WARN  [SyncThread:3:SendAckRequestProcessor@63] - Closing connection to leader, exception during packet send
      java.net.SocketException: Broken pipe
              at java.net.SocketOutputStream.socketWrite0(Native Method)
              at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:92)
              at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
              at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
              at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
              at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:126)
              at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:61)
              at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:164)
              at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:98)
      2011-06-30 22:14:53,398 - WARN  [QuorumPeer:/0.0.0.0:2181:Follower@82] - Exception when following the leader
      java.net.SocketException: Socket closed
              at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
              at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
              at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
              at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
              at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:126)
              at org.apache.zookeeper.server.quorum.Learner.ping(Learner.java:358)
              at org.apache.zookeeper.server.quorum.Follower.processPacket(Follower.java:108)
              at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:79)
              at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:634)
      2011-06-30 22:14:53,398 - WARN  [SyncThread:3:SendAckRequestProcessor@63] - Closing connection to leader, exception during packet send
      java.net.SocketException: Socket closed
              at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
              at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
              at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
              at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
              at org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:126)
              at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:61)
              at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:164)
              at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:98)
      2011-06-30 22:14:53,399 - INFO  [QuorumPeer:/0.0.0.0:2181:Follower@166] - shutdown called
      java.lang.Exception: shutdown Follower
              at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
              at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:638)
      and these are the leader's:
      2011-06-30 22:14:35,943 - ERROR [LearnerHandler-/10.23.247.163:14975:LearnerHandler@444] - Unexpected exception causing shutdown while sock still open
      java.net.SocketTimeoutException: Read timed out
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.read(SocketInputStream.java:129)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
              at java.io.DataInputStream.readInt(DataInputStream.java:370)
              at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
              at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
              at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
              at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:358)
      2011-06-30 22:14:35,943 - WARN  [LearnerHandler-/10.23.247.163:14975:LearnerHandler@457] - ******* GOODBYE /10.23.247.163:14975 ********
      2011-06-30 22:14:48,943 - ERROR [CommitProcessor:4:NIOServerCnxn@422] - Unexpected Exception: 
      java.nio.channels.CancelledKeyException
              at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
              at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
              at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:395)
              at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1360)
              at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:367)
              at org.apache.zookeeper.server.quorum.Leader$ToBeAppliedRequestProcessor.processRequest(Leader.java:535)
              at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)
      2011-06-30 22:14:49,084 - ERROR [LearnerHandler-/10.23.247.163:14998:LearnerHandler@444] - Unexpected exception causing shutdown while sock still open
      java.net.SocketTimeoutException: Read timed out
              at java.net.SocketInputStream.socketRead0(Native Method)
              at java.net.SocketInputStream.read(SocketInputStream.java:129)
              at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
              at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
              at java.io.DataInputStream.readInt(DataInputStream.java:370)
              at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
              at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
              at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
              at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:358)
      2011-06-30 22:14:49,084 - WARN  [LearnerHandler-/10.23.247.163:14998:LearnerHandler@457] - ******* GOODBYE /10.23.247.163:14998 ********
      

        Issue Links

          Activity

          Hide
          helei added a comment -

          Hi, guys. After follower get synced with leader, we will call zk.takeSnapshot in communication thread in Learner.java. And this is a function may block. In my case, it takes about 4 seconds to finish the snap. During this period, follower would not ack leader, and leader would thought that something happend to this follower, and close the connection with this follower. After that the follower will receive a broken pipe exception, and it will retry after that. Of cause it will fail time after time.
          I think it is a problem.
          thanks

          Show
          helei added a comment - Hi, guys. After follower get synced with leader, we will call zk.takeSnapshot in communication thread in Learner.java. And this is a function may block. In my case, it takes about 4 seconds to finish the snap. During this period, follower would not ack leader, and leader would thought that something happend to this follower, and close the connection with this follower. After that the follower will receive a broken pipe exception, and it will retry after that. Of cause it will fail time after time. I think it is a problem. thanks
          Hide
          Flavio Junqueira added a comment -

          Hi Helei, Have you tried increasing the value of initLimit? Check options here:

          http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html

          Show
          Flavio Junqueira added a comment - Hi Helei, Have you tried increasing the value of initLimit? Check options here: http://zookeeper.apache.org/doc/r3.3.3/zookeeperAdmin.html
          Hide
          helei added a comment -

          Hi Flavio, this does not help. When the follower took a long time to do a snap, the leader would timeout and then close the connection. I think the problem is here:
          Leader.java: 237 s.setSoTimeout(self.tickTime * self.syncLimit);
          After this time, Leader would close the connection. And follower would receive a broken pipe error.

          Thanks

          Show
          helei added a comment - Hi Flavio, this does not help. When the follower took a long time to do a snap, the leader would timeout and then close the connection. I think the problem is here: Leader.java: 237 s.setSoTimeout(self.tickTime * self.syncLimit); After this time, Leader would close the connection. And follower would receive a broken pipe error. Thanks
          Hide
          Flavio Junqueira added a comment -

          What about increasing syncLimit?

          Show
          Flavio Junqueira added a comment - What about increasing syncLimit?
          Hide
          helei added a comment -

          Hmm, of cause it works. But I really don't think that's the right way... It obviously harm the performance. We have to limit the node number, or the leader will be the single point. But that's OK, it surely works this time. Thanks again.

          Show
          helei added a comment - Hmm, of cause it works. But I really don't think that's the right way... It obviously harm the performance. We have to limit the node number, or the leader will be the single point. But that's OK, it surely works this time. Thanks again.
          Hide
          Patrick Hunt added a comment -

          Is this addressed in ZOOKEEPER-1521 ? i.e. you can set the initlimit higher to allow for longer initialization while not having to modify synclimit.

          Show
          Patrick Hunt added a comment - Is this addressed in ZOOKEEPER-1521 ? i.e. you can set the initlimit higher to allow for longer initialization while not having to modify synclimit.
          Hide
          Patrick Hunt added a comment -

          Flavio Junqueira thoughts? Should we close this?

          Show
          Patrick Hunt added a comment - Flavio Junqueira thoughts? Should we close this?
          Hide
          Flavio Junqueira added a comment -

          I'm resolving this one as not a problem, given the discussion in the jira, and please feel free to reopen if needed. Also, note that this jira was created for the 3.3 branch, and we've been recommending that users move to the 3.4 branch.

          Show
          Flavio Junqueira added a comment - I'm resolving this one as not a problem, given the discussion in the jira, and please feel free to reopen if needed. Also, note that this jira was created for the 3.3 branch, and we've been recommending that users move to the 3.4 branch.

            People

            • Assignee:
              Unassigned
              Reporter:
              helei
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development