Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.1
    • Fix Version/s: 3.2.1, 3.3.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      ZOOKEEPER-508 includes the fix.

      Description

      here are the part of the log whereby my zookeeper instance crashed, taking 3 out of 5 down, and thus ruining the quorum for all clients:

      2009-07-23 12:29:06,769 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x52276d1d5161350 due to java.io.IOException: Read error
      2009-07-23 12:29:00,756 WARN org.apache.zookeeper.server.quorum.Follower: Exception when following the leader
      java.io.EOFException
      at java.io.DataInputStream.readInt(DataInputStream.java:375)
      at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
      at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:65)
      at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
      at org.apache.zookeeper.server.quorum.Follower.readPacket(Follower.java:114)
      at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:243)
      at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:494)
      2009-07-23 12:29:06,770 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d5161350 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.168:39489]
      2009-07-23 12:29:06,770 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb0578 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.159:46797]
      2009-07-23 12:29:06,771 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa013e NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.153:33998]
      2009-07-23 12:29:06,771 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x52276d1d5160593 due to java.io.IOException: Read error
      2009-07-23 12:29:06,808 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e02bb NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.158:53758]
      2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa13e4 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.154:58681]
      2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e691382 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.162:59967]
      2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb1354 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.163:49957]
      2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa13cd NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.150:34212]
      2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e691383 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.159:46813]
      2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb0350 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.162:59956]
      2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e139b NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.156:55138]
      2009-07-23 12:29:06,809 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e1398 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.167:41257]
      2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d5161355 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.153:34032]
      2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d516011c NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.155:56314]
      2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e056b NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.155:56322]
      2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d516011f NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.157:49618]
      2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e11ea NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.10.20.42:55483]
      2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e02ba NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.157:49632]
      2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb1355 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.169:58824]
      2009-07-23 12:29:06,810 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e691378 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.161:40973]
      2009-07-23 12:29:06,811 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e691380 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.162:59944]
      2009-07-23 12:29:06,811 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e0311 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.160:56167]
      2009-07-23 12:29:06,811 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e690374 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.169:58815]
      2009-07-23 12:29:06,811 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e139f NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.151:51396]
      2009-07-23 12:29:06,811 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e139c NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.155:56315]
      2009-07-23 12:29:06,811 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e69137b NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.162:59859]
      2009-07-23 12:29:06,811 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d5160594 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.151:51370]
      2009-07-23 12:29:06,811 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e69137a NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.159:46682]
      2009-07-23 12:29:06,812 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d5160347 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.165:35722]
      2009-07-23 12:29:06,812 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e69137f NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.159:46754]
      2009-07-23 12:29:06,812 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d5160121 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.155:56307]
      2009-07-23 12:29:06,812 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb0126 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.154:58688]
      2009-07-23 12:29:06,812 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa05fc NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.152:45067]
      2009-07-23 12:29:06,812 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e0316 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.169:58800]
      2009-07-23 12:29:06,812 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e69137e NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.159:46737]
      2009-07-23 12:29:06,813 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e69137d NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.159:46733]
      2009-07-23 12:29:06,813 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa13df NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.156:55137]
      2009-07-23 12:29:06,813 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb134e NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.166:40443]
      2009-07-23 12:29:06,813 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e691381 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.161:41086]
      2009-07-23 12:29:06,813 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d5161356 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.165:35719]
      2009-07-23 12:29:06,813 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb1349 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.
      20.20.158:53770]
      2009-07-23 12:29:06,813 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x12276d15dfb0352 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.165:35718]
      2009-07-23 12:29:06,813 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e691379 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.162:59823]
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d516000e NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.150:34216]
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x32276d15d2e1397 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.169:58829]
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e69137c NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.162:59862]
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa0140 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.155:56271]
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x42276d1d3fa13e1 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.157:49608]
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x22276d15e691377 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.162:59789]
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x52276d1d5160593 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.165:35703]
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.FinalRequestProcessor: shutdown of request processor complete
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.quorum.FollowerRequestProcessor: FollowerRequestProcessor exited loop!
      2009-07-23 12:29:06,814 INFO org.apache.zookeeper.server.quorum.CommitProcessor: CommitProcessor exited loop!
      2009-07-23 12:29:06,815 INFO org.apache.zookeeper.server.quorum.Follower: shutdown called
      java.lang.Exception: shutdown Follower
      at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:427)
      at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)
      2009-07-23 12:29:06,815 WARN org.apache.zookeeper.server.NIOServerCnxn: Ignoring exception
      java.nio.channels.CancelledKeyException
      at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
      at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:69)
      at org.apache.zookeeper.server.NIOServerCnxn$Factory.run(NIOServerCnxn.java:201)
      2009-07-23 12:29:06,815 INFO org.apache.zookeeper.server.quorum.QuorumPeer: LOOKING
      2009-07-23 12:29:06,817 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
      2009-07-23 12:29:06,817 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.156:55206]
      2009-07-23 12:29:06,818 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
      2009-07-23 12:29:06,818 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.155:56331]
      [elided lots of the same]
      2009-07-23 12:29:33,008 INFO org.apache.zookeeper.server.NIOServerCnxn: closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/10.20.20.151:2181 remote=/10.20.20.152:5945
      8]
      2009-07-23 12:29:33,011 FATAL org.apache.zookeeper.server.SyncRequestProcessor: Severe unrecoverable error, exiting
      java.net.SocketException: Socket closed
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
      at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
      at org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100)
      at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52)
      at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:131)
      at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:76)

      The good news is when I restarted the downed zookeepers, everything returned to normal.

      1. QuorumTest.log
        533 kB
        Mahadev konar
      2. QuorumTest.log.gz
        3.16 MB
        Mahadev konar
      3. zklogs.tar.gz
        4.44 MB
        ryan rawson
      4. ZOOKEEPER-483.patch
        9 kB
        Benjamin Reed
      5. ZOOKEEPER-483.patch
        4 kB
        Benjamin Reed
      6. ZOOKEEPER-483.patch
        4 kB
        Benjamin Reed
      7. ZOOKEEPER-483.patch
        4 kB
        Benjamin Reed

        Issue Links

          Activity

          Hide
          Mahadev konar added a comment -

          ryan,

          can you post (attach as files) server logs in different files for all you zookeeper servers? The one exception I can see if that you had problems with your disk but I am surprised to see it took down 2 of your other servers.

          also, please explain the setup – I am assuming all the servers are on different machines, and the logs and snapshots go to different disk?

          Show
          Mahadev konar added a comment - ryan, can you post (attach as files) server logs in different files for all you zookeeper servers? The one exception I can see if that you had problems with your disk but I am surprised to see it took down 2 of your other servers. also, please explain the setup – I am assuming all the servers are on different machines, and the logs and snapshots go to different disk?
          Hide
          Mahadev konar added a comment -

          I meant -

          The one exception I can see IS (instead of IF) that you had problems with your disk but I am surprised to see it took down 2 of your other servers.

          Show
          Mahadev konar added a comment - I meant - The one exception I can see IS (instead of IF) that you had problems with your disk but I am surprised to see it took down 2 of your other servers.
          Hide
          Patrick Hunt added a comment -

          Someone is looking at this but I realized that we don't have any docs which detail supervision of the
          ZooKeeper server jvms. In particular ZOOKEEPER-485 has been created to add this detail to the
          documentation. Typically (although perhaps not always) you want to have supervision of the zk server
          processes - if the JVM exists abnormally for any reason the supervisor should notify the
          operator and restart the ZK server process. We'll have to see what caused this issue, but in some
          cases where a ZK server sees an un-recoverable problem (say the JVM heap is exhausted, etc...)
          the jvm process will exit – the supervisor will need to restart the jvm.

          Show
          Patrick Hunt added a comment - Someone is looking at this but I realized that we don't have any docs which detail supervision of the ZooKeeper server jvms. In particular ZOOKEEPER-485 has been created to add this detail to the documentation. Typically (although perhaps not always) you want to have supervision of the zk server processes - if the JVM exists abnormally for any reason the supervisor should notify the operator and restart the ZK server process. We'll have to see what caused this issue, but in some cases where a ZK server sees an un-recoverable problem (say the JVM heap is exhausted, etc...) the jvm process will exit – the supervisor will need to restart the jvm.
          Hide
          ryan rawson added a comment -

          here is the complete log set from all 5 of our quorum members for the entire day. note that the event happens around 12:30 pm local (PDT) time in the log. I restarted the 3 crashed machines at about 13:30.

          They were not run under supervision, but clearly they should. The machines in question only have 1 disk, since they also run other hadoop things, and I was hoping that the quorum model would provide reliability even if the local disk isnt 100%. Although, having said that, the local disk seems to be ok, since nothing else on them has died.

          Show
          ryan rawson added a comment - here is the complete log set from all 5 of our quorum members for the entire day. note that the event happens around 12:30 pm local (PDT) time in the log. I restarted the 3 crashed machines at about 13:30. They were not run under supervision, but clearly they should. The machines in question only have 1 disk, since they also run other hadoop things, and I was hoping that the quorum model would provide reliability even if the local disk isnt 100%. Although, having said that, the local disk seems to be ok, since nothing else on them has died.
          Hide
          Mahadev konar added a comment -

          sorry I misread the exception , thanks pat for pointing it out

          • it looks like the followers could not send acks back to the leader and thus exited. If you have monitoring in place, you should check for any network problems that might have happend during that time. Usually we run our zookeeper servers as daemons that are restarted autamtically if they fail. So in this case the servers that failed would have automatically restarted.
          • also, http://hadoop.apache.org/zookeeper/docs/r3.2.0/zookeeperAdmin.html#sc_CrossMachineRequirements has some hints on setting up your environment (servers on different rack/switch etc.). you might want to read through it.
          Show
          Mahadev konar added a comment - sorry I misread the exception , thanks pat for pointing it out it looks like the followers could not send acks back to the leader and thus exited. If you have monitoring in place, you should check for any network problems that might have happend during that time. Usually we run our zookeeper servers as daemons that are restarted autamtically if they fail. So in this case the servers that failed would have automatically restarted. also, http://hadoop.apache.org/zookeeper/docs/r3.2.0/zookeeperAdmin.html#sc_CrossMachineRequirements has some hints on setting up your environment (servers on different rack/switch etc.). you might want to read through it.
          Hide
          Mahadev konar added a comment -

          marking it for 3.2.1 for now...

          Show
          Mahadev konar added a comment - marking it for 3.2.1 for now...
          Hide
          Mahadev konar added a comment -

          ryan,
          ben just noticed this. this is the exception from your trace -

          ava.net.SocketException: Socket closed
          at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
          at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
          at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
          at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
          at org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100)
          at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52)
          at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:131)
          at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:76)
          
          

          We only put flush in SyncRequestProcessor in 3.2. Are you sure you are using 3.1.1?

          Show
          Mahadev konar added a comment - ryan, ben just noticed this. this is the exception from your trace - ava.net.SocketException: Socket closed at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100) at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:131) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:76) We only put flush in SyncRequestProcessor in 3.2. Are you sure you are using 3.1.1?
          Hide
          Patrick Hunt added a comment -

          Mahadev, we now include a bunch of environment information in the server log when it starts up, the following
          can be found in a few of the logs that Ryan attached:

          2009-07-23 13:27:39,363 INFO org.apache.zookeeper.server.ZooKeeperServer: Server environment:zookeeper.version=3.2.0--1, built on 05/15/2009 06:05 GMT

          Looks to me like it's not an official release, afaict something between 3.1 and 3.2 (built 5/15/2009), perhaps Ryan can shed more light
          on the specifics.

          Show
          Patrick Hunt added a comment - Mahadev, we now include a bunch of environment information in the server log when it starts up, the following can be found in a few of the logs that Ryan attached: 2009-07-23 13:27:39,363 INFO org.apache.zookeeper.server.ZooKeeperServer: Server environment:zookeeper.version=3.2.0--1, built on 05/15/2009 06:05 GMT Looks to me like it's not an official release, afaict something between 3.1 and 3.2 (built 5/15/2009), perhaps Ryan can shed more light on the specifics.
          Hide
          Patrick Hunt added a comment -

          This is interesting, from hbase-hadoop-zookeeper-sv2borg153.log

          2009-07-23 12:29:01,661 INFO org.apache.zookeeper.server.quorum.Follower: shutdown called
          java.lang.Exception: shutdown Follower
          at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:427)
          at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498)

          I don't see any further quorum messages until ~11 seconds later:

          2009-07-23 12:29:12,516 FATAL org.apache.zookeeper.server.SyncRequestProcessor: Severe unrecoverable error, exiting
          java.net.SocketException: Socket closed
          at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
          at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
          at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
          at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
          at org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100)
          at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52)
          at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:131)
          at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:76)
          2009-07-23 12:29:12,559 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running

          SRP should be running/active if the QP is not following the leader?

          Show
          Patrick Hunt added a comment - This is interesting, from hbase-hadoop-zookeeper-sv2borg153.log 2009-07-23 12:29:01,661 INFO org.apache.zookeeper.server.quorum.Follower: shutdown called java.lang.Exception: shutdown Follower at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:427) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:498) I don't see any further quorum messages until ~11 seconds later: 2009-07-23 12:29:12,516 FATAL org.apache.zookeeper.server.SyncRequestProcessor: Severe unrecoverable error, exiting java.net.SocketException: Socket closed at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100) at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:131) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:76) 2009-07-23 12:29:12,559 WARN org.apache.zookeeper.server.NIOServerCnxn: Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running SRP should be running/active if the QP is not following the leader?
          Hide
          ryan rawson added a comment -

          In hbase we have a custom version, here it is:

          zookeeper-r785019-hbase-1329.jar

          based on that svn #, and plus a small patch from hbase-1329. This was opened as ZOOKEEPER-457.

          I think the problem might have happened when someone ran a heavy duty map-reduce with a major pile of disk IO intermediate output, which may have delayed zookeeper log writes. I'm looking to move the quorum to other machines.

          Show
          ryan rawson added a comment - In hbase we have a custom version, here it is: zookeeper-r785019-hbase-1329.jar based on that svn #, and plus a small patch from hbase-1329. This was opened as ZOOKEEPER-457 . I think the problem might have happened when someone ran a heavy duty map-reduce with a major pile of disk IO intermediate output, which may have delayed zookeeper log writes. I'm looking to move the quorum to other machines.
          Hide
          Benjamin Reed added a comment -

          i was able to reproduce the problem. and the patch was a missing catch for a socket exception.

          Show
          Benjamin Reed added a comment - i was able to reproduce the problem. and the patch was a missing catch for a socket exception.
          Hide
          Patrick Hunt added a comment -

          Cancelling patch – I see two issues with this patch that need to be addressed:

          1) in two places the following code was added:

          + } catch (IOException e1)

          { + }

          notice that the exception is not being logged. It should be logged as a warning

          2) the flush method is declared to throw IOException, but we are eating it. Perhaps
          it's ok in this situation? Regardless, having a comment here that states why
          it's ok to eat the exception would be useful.

          You sure we should be eating the exception here? rather than handling upstream?

          I'm esp worried given that this class sendackrequestprocessor doesn't own the socket, but
          we are closing the socket here (as part of the patch). Seems like a caution signal to me. Should
          this be handled by higher level code? Perhaps other RPs will have smilar issue (ioexception
          on flush that needs to be handled/cleanedup etc...)

          Show
          Patrick Hunt added a comment - Cancelling patch – I see two issues with this patch that need to be addressed: 1) in two places the following code was added: + } catch (IOException e1) { + } notice that the exception is not being logged. It should be logged as a warning 2) the flush method is declared to throw IOException, but we are eating it. Perhaps it's ok in this situation? Regardless, having a comment here that states why it's ok to eat the exception would be useful. You sure we should be eating the exception here? rather than handling upstream? I'm esp worried given that this class sendackrequestprocessor doesn't own the socket, but we are closing the socket here (as part of the patch). Seems like a caution signal to me. Should this be handled by higher level code? Perhaps other RPs will have smilar issue (ioexception on flush that needs to be handled/cleanedup etc...)
          Hide
          Benjamin Reed added a comment -

          I've addressed 1) in the attached patch.

          for 2) we are not eating the IOException. we are actually shutting things down. the bug is actually that we are passing it up to the upper layer, which does not know anything about the follower thread. we need to handle it here.

          Show
          Benjamin Reed added a comment - I've addressed 1) in the attached patch. for 2) we are not eating the IOException. we are actually shutting things down. the bug is actually that we are passing it up to the upper layer, which does not know anything about the follower thread. we need to handle it here.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12415695/ZOOKEEPER-483.patch
          against trunk revision 801839.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/176/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415695/ZOOKEEPER-483.patch against trunk revision 801839. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/176/console This message is automatically generated.
          Hide
          Benjamin Reed added a comment -

          fixed patch to apply cleanly.

          Show
          Benjamin Reed added a comment - fixed patch to apply cleanly.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12415974/ZOOKEEPER-483.patch
          against trunk revision 802188.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/182/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/182/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/182/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415974/ZOOKEEPER-483.patch against trunk revision 802188. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/182/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/182/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/182/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          the test added with the patch failed on hudson.. ben can you take a look?

          Show
          Mahadev konar added a comment - the test added with the patch failed on hudson.. ben can you take a look?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12415974/ZOOKEEPER-483.patch
          against trunk revision 803000.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/185/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/185/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/185/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415974/ZOOKEEPER-483.patch against trunk revision 803000. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/185/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/185/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/185/console This message is automatically generated.
          Hide
          Hudson added a comment -

          Integrated in ZooKeeper-trunk #412 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/412/)
          reverted: . ZK fataled on me, and ugly
          . ZK fataled on me, and ugly

          Show
          Hudson added a comment - Integrated in ZooKeeper-trunk #412 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/412/ ) reverted: . ZK fataled on me, and ugly . ZK fataled on me, and ugly
          Hide
          Mahadev konar added a comment -

          i ran ant test -Dtestcase=QuorumTest after applying the patch for 6 times and out of 6 it crashed with the same problems that hudson noticed. I will post the logs as soon - (it crashes the jvm so junit fails to get the logs)

          Show
          Mahadev konar added a comment - i ran ant test -Dtestcase=QuorumTest after applying the patch for 6 times and out of 6 it crashed with the same problems that hudson noticed. I will post the logs as soon - (it crashes the jvm so junit fails to get the logs)
          Hide
          Mahadev konar added a comment -

          i meant, it crashed 2 out of 6 times.

          Show
          Mahadev konar added a comment - i meant, it crashed 2 out of 6 times.
          Hide
          Mahadev konar added a comment -

          the logs of one of the QuoruMTest failures on my machine.

          Show
          Mahadev konar added a comment - the logs of one of the QuoruMTest failures on my machine.
          Hide
          Mahadev konar added a comment -

          this is the code that is causing the system.exit() in the logs that I uploaded

          File: FollowerZooKeeperServer.java - line 128

           if (pendingTxns.size() == 0) {
                      LOG.warn("Committing " + Long.toHexString(zxid)
                              + " without seeing txn");
                      return;
                  }
                  long firstElementZxid = pendingTxns.element().zxid;
                  if (firstElementZxid != zxid) {
                      LOG.fatal("Committing zxid 0x" + Long.toHexString(zxid)
                              + " but next pending txn 0x"
                              + Long.toHexString(firstElementZxid));
                      System.exit(12);
          
          
          Show
          Mahadev konar added a comment - this is the code that is causing the system.exit() in the logs that I uploaded File: FollowerZooKeeperServer.java - line 128 if (pendingTxns.size() == 0) { LOG.warn( "Committing " + Long .toHexString(zxid) + " without seeing txn" ); return ; } long firstElementZxid = pendingTxns.element().zxid; if (firstElementZxid != zxid) { LOG.fatal( "Committing zxid 0x" + Long .toHexString(zxid) + " but next pending txn 0x" + Long .toHexString(firstElementZxid)); System .exit(12);
          Hide
          Mahadev konar added a comment -

          attaching the log file with trace turned on...

          Show
          Mahadev konar added a comment - attaching the log file with trace turned on...
          Hide
          Benjamin Reed added a comment -

          The test case exposed another bug: log truncation was not being done properly with the buffered inputstream. i modified the test to make it fail reliably and then fixed the bug.

          Show
          Benjamin Reed added a comment - The test case exposed another bug: log truncation was not being done properly with the buffered inputstream. i modified the test to make it fail reliably and then fixed the bug.
          Hide
          Benjamin Reed added a comment -

          just to be clear. this bug isn't completely fixed and the test case should still be failing. i just want to make sure it fails reliably on the hudson machine.

          Show
          Benjamin Reed added a comment - just to be clear. this bug isn't completely fixed and the test case should still be failing. i just want to make sure it fails reliably on the hudson machine.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12416645/ZOOKEEPER-483.patch
          against trunk revision 803300.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/186/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/186/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/186/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12416645/ZOOKEEPER-483.patch against trunk revision 803300. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/186/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/186/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-vesta.apache.org/186/console This message is automatically generated.
          Hide
          Mahadev konar added a comment -

          ben,
          should me make the calss positioninputstream a seperate public class under persistence? It might be helpful for other use cases... just a thought....

          Show
          Mahadev konar added a comment - ben, should me make the calss positioninputstream a seperate public class under persistence? It might be helpful for other use cases... just a thought....
          Hide
          Mahadev konar added a comment -

          also, ben I think we should create a seperate jira for truncate and link it to this. We dont have to upload seperate patches, but just for accounting sakes.

          Show
          Mahadev konar added a comment - also, ben I think we should create a seperate jira for truncate and link it to this. We dont have to upload seperate patches, but just for accounting sakes.
          Hide
          Mahadev konar added a comment -

          included in ZOOKEEPER-508

          Show
          Mahadev konar added a comment - included in ZOOKEEPER-508
          Hide
          Mahadev konar added a comment -

          fix included in ZOOKEEPER-508.

          Show
          Mahadev konar added a comment - fix included in ZOOKEEPER-508 .

            People

            • Assignee:
              Benjamin Reed
              Reporter:
              ryan rawson
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development