Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 3.2.0
    • Fix Version/s: 3.3.0
    • Component/s: quorum, server
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      I was doing some fault injection testing of 3.2.1 with ZOOKEEPER-508 patch applied and noticed that after some time the ensemble failed to re-elect a leader.

      See the attached log files - 5 member ensemble. typically 5 is the leader

      Notice that after 16:23:50,525 no quorum is formed, even after 20 minutes elapses w/no quorum

      environment:

      I was doing fault injection testing using aspectj. The faults are injected into socketchannel read/write, I throw exceptions randomly at a 1/200 ratio (rand.nextFloat() <= .005 => throw IOException

      You can see when a fault is injected in the log via:
      2009-08-19 16:57:09,568 - INFO [Thread-74:ReadRequestFailsIntermittently@38] - READPACKET FORCED FAIL

      vs a read/write that didn't force fail:
      2009-08-19 16:57:09,568 - INFO [Thread-74:ReadRequestFailsIntermittently@41] - READPACKET OK

      otw standard code/config (straight fle quorum with 5 members)

      also see the attached jstack trace. this is for one of the servers. Notice in particular that the number of sendworkers != the number of recv workers.

      1. ZOOKEEPER-512.patch
        3 kB
        Flavio Junqueira
      2. ZOOKEEPER-512.patch
        0.9 kB
        Flavio Junqueira
      3. ZOOKEEPER-512.patch
        2 kB
        Flavio Junqueira
      4. ZOOKEEPER-512.patch
        5 kB
        Flavio Junqueira
      5. ZOOKEEPER-512.patch
        5 kB
        Flavio Junqueira
      6. t5_aj.tar.gz
        1.42 MB
        Patrick Hunt
      7. logs2.tar.gz
        599 kB
        Patrick Hunt
      8. logs.tar.gz
        229 kB
        Patrick Hunt
      9. log3_debug.tar.gz
        820 kB
        Patrick Hunt
      10. jst.txt
        10 kB
        Patrick Hunt

        Activity

        Hide
        Flavio Junqueira added a comment -

        I'm not convinced this is a bug. Right now it sounds to me that the problem is with the way you're injecting faults. More concretely, it sounds like some threads are getting IOException, but the corresponding socket is not closing. As recv and sender come in pairs, if one dies and the other doesn't, we have a problem. At the same time, I believe the current code would eventually terminate a pair of workers send/recv if the socket closes. It is true, though, that the current code assumes that if RecvWorker catches an IOException when performing an socket operation, then the corresponding SendWorker will also catch an exception when trying to write to the socket. This is where I think your framework is broken, but please correct me if I'm missing anything.

        Show
        Flavio Junqueira added a comment - I'm not convinced this is a bug. Right now it sounds to me that the problem is with the way you're injecting faults. More concretely, it sounds like some threads are getting IOException, but the corresponding socket is not closing. As recv and sender come in pairs, if one dies and the other doesn't, we have a problem. At the same time, I believe the current code would eventually terminate a pair of workers send/recv if the socket closes. It is true, though, that the current code assumes that if RecvWorker catches an IOException when performing an socket operation, then the corresponding SendWorker will also catch an exception when trying to write to the socket. This is where I think your framework is broken, but please correct me if I'm missing anything.
        Hide
        Patrick Hunt added a comment -

        Your explanation sounds reasonable, but I don't see anything in the java socket

        {channel}

        apis that talk about this. perhaps I missed it. Do you have a pointer to something that talks about this? (I did some searches and couldn't find). Basically, why should we assume that any ioexception results in the socket being closed?

        Show
        Patrick Hunt added a comment - Your explanation sounds reasonable, but I don't see anything in the java socket {channel} apis that talk about this. perhaps I missed it. Do you have a pointer to something that talks about this? (I did some searches and couldn't find). Basically, why should we assume that any ioexception results in the socket being closed?
        Hide
        Patrick Hunt added a comment -

        take a look at logs2, this is similar fault injection model, however I'm now:

        sock.close()
        throw IOException

        rather than just throwing the ioexception. otw basically the same test as before.

        Notice that 1 drops off alot earlier than the rest (seems due to it's server id being the lowest?)

        Show
        Patrick Hunt added a comment - take a look at logs2, this is similar fault injection model, however I'm now: sock.close() throw IOException rather than just throwing the ioexception. otw basically the same test as before. Notice that 1 drops off alot earlier than the rest (seems due to it's server id being the lowest?)
        Hide
        Patrick Hunt added a comment -

        sorry, to be overly clear – the same problem occurs in this case (close/throw) – the quorum cannot be formed after some time.

        Show
        Patrick Hunt added a comment - sorry, to be overly clear – the same problem occurs in this case (close/throw) – the quorum cannot be formed after some time.
        Hide
        Flavio Junqueira added a comment -

        Two things:

        1- I'm not sure what you've been searching for, so I don't have a pointer, but the behavior I expect is that if you get an IOException upon invoking a socket operation, then the operation won't be available after that. Am I not interpreting it correctly?
        2- Visually expecting the logs, I was able to count about 20 successful leader elections. In the previous set of logs, I think servers got stuck around 5, so I see improvement after you modified your fault injection. Also, according to server 5, a leader was elected successfully. Here is the tail of the log of 5:

        2009-08-20 13:43:56,636 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2185:QuorumPeer@508] - LEADING
        2009-08-20 13:43:56,636 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2185:ZooKeeperServer@160] - Created server
        2009-08-20 13:43:56,643 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2185:FileSnap@81] - Reading snapshot ./localhost:2185/data/version-2/snapshot.1c0000001f
        2009-08-20 13:43:56,699 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2185:FileTxnSnapLog@208] - Snapshotting: 1c0000001f
        2009-08-20 13:43:56,844 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:56,845 - INFO  [FollowerHandler-/127.0.0.1:55253:FollowerHandler@227] - Follower sid: 4 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@1b1aa65
        2009-08-20 13:43:56,845 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:56,845 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:56,846 - WARN  [FollowerHandler-/127.0.0.1:55253:FollowerHandler@302] - Sending snapshot last zxid of peer is 0x1c0000001f  zxid of leader is 0x1d00000000
        2009-08-20 13:43:56,847 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:56,848 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:56,848 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:56,849 - INFO  [FollowerHandler-/127.0.0.1:55254:FollowerHandler@227] - Follower sid: 2 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@1ef9157
        2009-08-20 13:43:56,849 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:56,850 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:56,850 - WARN  [FollowerHandler-/127.0.0.1:55254:FollowerHandler@302] - Sending snapshot last zxid of peer is 0x1c0000001f  zxid of leader is 0x1d00000000
        2009-08-20 13:43:56,851 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:56,851 - WARN  [FollowerHandler-/127.0.0.1:55254:Leader@452] - Commiting zxid 0x1d00000000 from /127.0.0.1:3185 not first!
        2009-08-20 13:43:56,852 - WARN  [FollowerHandler-/127.0.0.1:55254:Leader@454] - First is 0
        2009-08-20 13:43:56,852 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:59,434 - INFO  [Thread-536:RequestFailsIntermittently@120] - SOCKET REQUEST OK
        2009-08-20 13:43:59,434 - INFO  [Thread-536:RequestFailsIntermittently@120] - SOCKET REQUEST OK
        2009-08-20 13:43:59,435 - INFO  [WorkerReceiver Thread:FastLeaderElection$Messenger$WorkerReceiver@254] - Sending new notification.
        2009-08-20 13:43:59,435 - INFO  [Thread-535:RequestFailsIntermittently@120] - SOCKET REQUEST OK
        2009-08-20 13:43:59,846 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:43:59,848 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:00,765 - INFO  [NIOServerCxn.Factory:2185:NIOServerCnxn@698] - Processing stat command from /127.0.0.1:38350
        2009-08-20 13:44:00,766 - WARN  [NIOServerCxn.Factory:2185:NIOServerCnxn@494] - Exception causing close of session 0x0 due to java.io.IOException: Responded to info probe
        2009-08-20 13:44:00,766 - INFO  [NIOServerCxn.Factory:2185:NIOServerCnxn@833] - closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/127.0.0.1:2185 remote=/127.0.0.1:38350]
        2009-08-20 13:44:00,846 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:00,848 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:01,847 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:01,848 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:02,847 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:02,848 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:03,847 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:03,849 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:04,847 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:04,849 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:05,847 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:05,849 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:06,849 - INFO  [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK
        2009-08-20 13:44:06,849 - INFO  [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK
        

        Inspecting the logs a little more, it is true, though, that 1 and 3 are stuck. It could be due to the challenge protocol as you suggest, so we should keep it in mind.

        Show
        Flavio Junqueira added a comment - Two things: 1- I'm not sure what you've been searching for, so I don't have a pointer, but the behavior I expect is that if you get an IOException upon invoking a socket operation, then the operation won't be available after that. Am I not interpreting it correctly? 2- Visually expecting the logs, I was able to count about 20 successful leader elections. In the previous set of logs, I think servers got stuck around 5, so I see improvement after you modified your fault injection. Also, according to server 5, a leader was elected successfully. Here is the tail of the log of 5: 2009-08-20 13:43:56,636 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:QuorumPeer@508] - LEADING 2009-08-20 13:43:56,636 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:ZooKeeperServer@160] - Created server 2009-08-20 13:43:56,643 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:FileSnap@81] - Reading snapshot ./localhost:2185/data/version-2/snapshot.1c0000001f 2009-08-20 13:43:56,699 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2185:FileTxnSnapLog@208] - Snapshotting: 1c0000001f 2009-08-20 13:43:56,844 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:56,845 - INFO [FollowerHandler-/127.0.0.1:55253:FollowerHandler@227] - Follower sid: 4 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@1b1aa65 2009-08-20 13:43:56,845 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:56,845 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:56,846 - WARN [FollowerHandler-/127.0.0.1:55253:FollowerHandler@302] - Sending snapshot last zxid of peer is 0x1c0000001f zxid of leader is 0x1d00000000 2009-08-20 13:43:56,847 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:56,848 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:56,848 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:56,849 - INFO [FollowerHandler-/127.0.0.1:55254:FollowerHandler@227] - Follower sid: 2 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@1ef9157 2009-08-20 13:43:56,849 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:56,850 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:56,850 - WARN [FollowerHandler-/127.0.0.1:55254:FollowerHandler@302] - Sending snapshot last zxid of peer is 0x1c0000001f zxid of leader is 0x1d00000000 2009-08-20 13:43:56,851 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:56,851 - WARN [FollowerHandler-/127.0.0.1:55254:Leader@452] - Commiting zxid 0x1d00000000 from /127.0.0.1:3185 not first! 2009-08-20 13:43:56,852 - WARN [FollowerHandler-/127.0.0.1:55254:Leader@454] - First is 0 2009-08-20 13:43:56,852 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:59,434 - INFO [Thread-536:RequestFailsIntermittently@120] - SOCKET REQUEST OK 2009-08-20 13:43:59,434 - INFO [Thread-536:RequestFailsIntermittently@120] - SOCKET REQUEST OK 2009-08-20 13:43:59,435 - INFO [WorkerReceiver Thread:FastLeaderElection$Messenger$WorkerReceiver@254] - Sending new notification. 2009-08-20 13:43:59,435 - INFO [Thread-535:RequestFailsIntermittently@120] - SOCKET REQUEST OK 2009-08-20 13:43:59,846 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:43:59,848 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:00,765 - INFO [NIOServerCxn.Factory:2185:NIOServerCnxn@698] - Processing stat command from /127.0.0.1:38350 2009-08-20 13:44:00,766 - WARN [NIOServerCxn.Factory:2185:NIOServerCnxn@494] - Exception causing close of session 0x0 due to java.io.IOException: Responded to info probe 2009-08-20 13:44:00,766 - INFO [NIOServerCxn.Factory:2185:NIOServerCnxn@833] - closing session:0x0 NIOServerCnxn: java.nio.channels.SocketChannel[connected local=/127.0.0.1:2185 remote=/127.0.0.1:38350] 2009-08-20 13:44:00,846 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:00,848 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:01,847 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:01,848 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:02,847 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:02,848 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:03,847 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:03,849 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:04,847 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:04,849 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:05,847 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:05,849 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:06,849 - INFO [FollowerHandler-/127.0.0.1:55253:RequestFailsIntermittently@91] - RECORD REQUEST OK 2009-08-20 13:44:06,849 - INFO [FollowerHandler-/127.0.0.1:55254:RequestFailsIntermittently@91] - RECORD REQUEST OK Inspecting the logs a little more, it is true, though, that 1 and 3 are stuck. It could be due to the challenge protocol as you suggest, so we should keep it in mind.
        Hide
        Flavio Junqueira added a comment -

        Pat, could you run it again and switch log debug on for QuorumCnxManager, please?

        Show
        Flavio Junqueira added a comment - Pat, could you run it again and switch log debug on for QuorumCnxManager, please?
        Hide
        Flavio Junqueira added a comment -

        I just realized that there is a bug in the first comment I posted today. I wanted to say that: if you get an IOException upon invoking a socket operation, then the SOCKET won't be available after that. (I really miss the ability to edit comments.)

        Show
        Flavio Junqueira added a comment - I just realized that there is a bug in the first comment I posted today. I wanted to say that: if you get an IOException upon invoking a socket operation, then the SOCKET won't be available after that. (I really miss the ability to edit comments.)
        Hide
        Patrick Hunt added a comment -

        I've been reading the Java API spec, for example:
        http://java.sun.com/javase/6/docs/api/java/nio/channels/SocketChannel.html#read%28java.nio.ByteBuffer%29

        there's nothing here (nor in Socket docs) that I can find that says that an ioexception thrown by the read method
        results in what you say you are expecting. Unless you can find otw I don't think it's prudent to assume a particular
        behavior.

        The quorum was def not formed when I took the log snapshot, there was no active leader.. Clients were not able to
        connect to any server in the cluster, and running "stat" on the command port resulted in "zookeeper server not running" being
        returned by all 5 servers. (not the typical "... mode:follower...." etc... stat result.)

        I'll re-run and attach with debug logs.

        Show
        Patrick Hunt added a comment - I've been reading the Java API spec, for example: http://java.sun.com/javase/6/docs/api/java/nio/channels/SocketChannel.html#read%28java.nio.ByteBuffer%29 there's nothing here (nor in Socket docs) that I can find that says that an ioexception thrown by the read method results in what you say you are expecting. Unless you can find otw I don't think it's prudent to assume a particular behavior. The quorum was def not formed when I took the log snapshot, there was no active leader.. Clients were not able to connect to any server in the cluster, and running "stat" on the command port resulted in "zookeeper server not running" being returned by all 5 servers. (not the typical "... mode:follower...." etc... stat result.) I'll re-run and attach with debug logs.
        Hide
        Patrick Hunt added a comment -

        debug logs for similar situation as previous.

        Show
        Patrick Hunt added a comment - debug logs for similar situation as previous.
        Hide
        Patrick Hunt added a comment -

        sample test environment. start.sh is setup to cause failures at .5% probability (each net operation)

        Show
        Patrick Hunt added a comment - sample test environment. start.sh is setup to cause failures at .5% probability (each net operation)
        Hide
        Flavio Junqueira added a comment -

        Based on feedback I got on how to handle socket exceptions, I have added a finally block that closes the channel in RecvWorker. I haven't been able to reproduce the problem reliably yet, but I can't see any issue after applying this patch. It would be good if Pat could run it to see if the problem still manifests.

        Show
        Flavio Junqueira added a comment - Based on feedback I got on how to handle socket exceptions, I have added a finally block that closes the channel in RecvWorker. I haven't been able to reproduce the problem reliably yet, but I can't see any issue after applying this patch. It would be good if Pat could run it to see if the problem still manifests.
        Hide
        Patrick Hunt added a comment -

        I don't see any change in behavior, still see similar issues as before.

        Also the patch fails to compile - close is declared to throw ioexception, a checked exception. I had to
        wrap with a try/catch/log.warn.

        Show
        Patrick Hunt added a comment - I don't see any change in behavior, still see similar issues as before. Also the patch fails to compile - close is declared to throw ioexception, a checked exception. I had to wrap with a try/catch/log.warn.
        Hide
        Mahadev konar added a comment -

        can we move this out to 3.3? I dont think its a regression or is it?

        Show
        Mahadev konar added a comment - can we move this out to 3.3? I dont think its a regression or is it?
        Hide
        Flavio Junqueira added a comment -

        I have finally been able to reproduce it reliably, and it is true, the ensemble stalls after a while. Looking at the logs, I realized the same as Pat: not enough votes are coming through. Soon later, however, I also realized that most processes died, and here is the cause:

        2009-08-25 10:51:04,617 - FATAL [SyncThread:2:SyncRequestProcessor@131] - Severe unrecoverable error, exiting
        java.net.SocketException: Socket closed
                at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99)
                at java.net.SocketOutputStream.write(SocketOutputStream.java:136)
                at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
                at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
                at org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100)
                at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52)
                at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:147)
                at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:92)
        

        After three zookeeper processes die this way, we can't obviously form a quorum. My conclusion is that the aspects are killing the processes, and leader election cannot succeed without a quorum.

        I think we should still add the finally block as in the broken patch I uploaded before. It makes sense to have it, but it is probably ok if we postpone it to 3.3.

        Show
        Flavio Junqueira added a comment - I have finally been able to reproduce it reliably, and it is true, the ensemble stalls after a while. Looking at the logs, I realized the same as Pat: not enough votes are coming through. Soon later, however, I also realized that most processes died, and here is the cause: 2009-08-25 10:51:04,617 - FATAL [SyncThread:2:SyncRequestProcessor@131] - Severe unrecoverable error, exiting java.net.SocketException: Socket closed at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:99) at java.net.SocketOutputStream.write(SocketOutputStream.java:136) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.apache.zookeeper.server.quorum.Follower.writePacket(Follower.java:100) at org.apache.zookeeper.server.quorum.SendAckRequestProcessor.flush(SendAckRequestProcessor.java:52) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.java:147) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.java:92) After three zookeeper processes die this way, we can't obviously form a quorum. My conclusion is that the aspects are killing the processes, and leader election cannot succeed without a quorum. I think we should still add the finally block as in the broken patch I uploaded before. It makes sense to have it, but it is probably ok if we postpone it to 3.3.
        Hide
        Flavio Junqueira added a comment -

        I realized that the logs that Pat posted do not contain the same messages indicating that the processes exited. My understanding is that the finally block had some effect and now the test is exposing a potential new issue.

        Just in case, I'm uploading a simple patch for testing. It compiles fine for me.

        Show
        Flavio Junqueira added a comment - I realized that the logs that Pat posted do not contain the same messages indicating that the processes exited. My understanding is that the finally block had some effect and now the test is exposing a potential new issue. Just in case, I'm uploading a simple patch for testing. It compiles fine for me.
        Hide
        Flavio Junqueira added a comment -

        New patch fixes two issues:

        1. Adds finally block to method run of RecvWorker;
        2. Replaces take() with poll() in SendWorker(). This fixes a race condition that can easily manifest when the system goes haywire (see Pat's aspects for examples).

        This patch should apply to both 3.2 and trunk.

        Show
        Flavio Junqueira added a comment - New patch fixes two issues: Adds finally block to method run of RecvWorker; Replaces take() with poll() in SendWorker(). This fixes a race condition that can easily manifest when the system goes haywire (see Pat's aspects for examples). This patch should apply to both 3.2 and trunk.
        Hide
        Patrick Hunt added a comment -

        I'm afraid that with this latest patch I'm still seeing similar to what I was seeing previously.

        when I drive the cluster hard (in this case 6 clients, each client connecting to each of the 5 servers
        each session creating/getting/deleting a particular node inside a loop that runs every second (sleeps for 1 sec at end of loop)

        Show
        Patrick Hunt added a comment - I'm afraid that with this latest patch I'm still seeing similar to what I was seeing previously. when I drive the cluster hard (in this case 6 clients, each client connecting to each of the 5 servers each session creating/getting/deleting a particular node inside a loop that runs every second (sleeps for 1 sec at end of loop)
        Hide
        Flavio Junqueira added a comment -

        I've found a corner that I was not expecting but was an easy fix. Basically receiveConnection was passing an invalid server identifier to connectOne which was throwing a NPE and was propagating to the Listener. The listener would consequently die, and would stop participating in leader election. The fix is just not to proceed with the connection when that happens.

        This patch is working fine for my test. It is important to note, though, that if we are overwhelming servers so much (clients are hammering the system and connections are failing), then there will be periods in which there will be no leader. The important invariant to satisfy is that the system converges to a live state once it stabilizes. In my tests, I observe periods with no leader when clients are hammering the servers with requests, but they converge to a leader soon after the clients stop. Of course, if we have no injected faults, the clients requests are executed just fine (there is always a leader). This is the behavior I expect to see.

        At the same time, although I think it was a good idea to test such an extreme case, I'm still not convinced that this test is realistic. It would be great if we could model the cases this fault injection is trying to emulate to make sure they are really expected cases.

        Also, I don't see a good way of introducing a unit test for such extreme cases. In fact, I'm not even sure it would make sense to test only leader election under such extreme conditions.

        Show
        Flavio Junqueira added a comment - I've found a corner that I was not expecting but was an easy fix. Basically receiveConnection was passing an invalid server identifier to connectOne which was throwing a NPE and was propagating to the Listener. The listener would consequently die, and would stop participating in leader election. The fix is just not to proceed with the connection when that happens. This patch is working fine for my test. It is important to note, though, that if we are overwhelming servers so much (clients are hammering the system and connections are failing), then there will be periods in which there will be no leader. The important invariant to satisfy is that the system converges to a live state once it stabilizes. In my tests, I observe periods with no leader when clients are hammering the servers with requests, but they converge to a leader soon after the clients stop. Of course, if we have no injected faults, the clients requests are executed just fine (there is always a leader). This is the behavior I expect to see. At the same time, although I think it was a good idea to test such an extreme case, I'm still not convinced that this test is realistic. It would be great if we could model the cases this fault injection is trying to emulate to make sure they are really expected cases. Also, I don't see a good way of introducing a unit test for such extreme cases. In fact, I'm not even sure it would make sense to test only leader election under such extreme conditions.
        Hide
        Benjamin Reed added a comment -

        agreed. i think the problem is that under high load we don't have a period of error free operation. i think it is ok to generate errors randomly as we are doing, but we should have periods of error free operation so that things can settle down.

        Show
        Benjamin Reed added a comment - agreed. i think the problem is that under high load we don't have a period of error free operation. i think it is ok to generate errors randomly as we are doing, but we should have periods of error free operation so that things can settle down.
        Hide
        Patrick Hunt added a comment -

        I'm seeing 2 cases:

        1) the entire quorum is unstable because clients are driving and causing many network (simulated) failures, in this case I agree

        2) but I also see the case where the quorum is stable, but there's one server that's
        been orphaned from the group. it is never able to reconnect, even though the clients
        are stopped and the quorum in general is stable.

        eventually 3 servers become orphaned (out of 5), in which case regardless of clients are running
        or not the quorum will never re-form. I don't agree that this is "ok".

        Show
        Patrick Hunt added a comment - I'm seeing 2 cases: 1) the entire quorum is unstable because clients are driving and causing many network (simulated) failures, in this case I agree 2) but I also see the case where the quorum is stable, but there's one server that's been orphaned from the group. it is never able to reconnect, even though the clients are stopped and the quorum in general is stable. eventually 3 servers become orphaned (out of 5), in which case regardless of clients are running or not the quorum will never re-form. I don't agree that this is "ok".
        Hide
        Flavio Junqueira added a comment -

        Pat, I didn't understand from your last comment if you have tried the patch I uploaded yesterday. If you did and it still doesn't work for you, I would appreciate if you could upload logs and jstack traces when you have a chance.

        Show
        Flavio Junqueira added a comment - Pat, I didn't understand from your last comment if you have tried the patch I uploaded yesterday. If you did and it still doesn't work for you, I would appreciate if you could upload logs and jstack traces when you have a chance.
        Hide
        Patrick Hunt added a comment -

        I tried your latest patch with the latest trunk code and I'm not able to reproduce the problem. Looks like this
        is addressing the problem.

        Show
        Patrick Hunt added a comment - I tried your latest patch with the latest trunk code and I'm not able to reproduce the problem. Looks like this is addressing the problem.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12417726/ZOOKEEPER-512.patch
        against trunk revision 823371.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/20/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/20/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/20/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12417726/ZOOKEEPER-512.patch against trunk revision 823371. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/20/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/20/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/20/console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        Pat, does it make sense to add your fault injection framework as a test for this patch?

        Show
        Flavio Junqueira added a comment - Pat, does it make sense to add your fault injection framework as a test for this patch?
        Hide
        Patrick Hunt added a comment -

        I don't think we are ready for that. It's pretty straightforward to run manually, however running automatically
        it's difficult to determine success/failure for example. I sure we can do something but
        it will take some effort.

        Show
        Patrick Hunt added a comment - I don't think we are ready for that. It's pretty straightforward to run manually, however running automatically it's difficult to determine success/failure for example. I sure we can do something but it will take some effort.
        Hide
        Mahadev konar added a comment -

        flavio, the patch looks good -

        The following logging can be imprvoed to include which quorum server it corresponds to (for unit testing) and in general.

                    LOG.info("Leaving listener");
                    if(!shutdown)
                        LOG.fatal("As I'm leaving the listener thread, I won't be able to participate in leader election any longer... digital life sucks");
        

        Also, I can see the hatred for digital life , but a more useful logging message would be better !

        • also I am having troble understanding this -
        synchronized void connectOne(long sid){
         if (senderWorkerMap.get(sid) == null){
                    InetSocketAddress electionAddr;
                    if(self.quorumPeers.containsKey(sid))
                        electionAddr =
                            self.quorumPeers.get(sid).electionAddr;
                    else{
                        LOG.warn("Invalid server id: " + sid);
                        return;
                    }
        

        you mentioned above that connectOne was being called with a sid that wasnt in the map. Is that possible?

        Show
        Mahadev konar added a comment - flavio, the patch looks good - The following logging can be imprvoed to include which quorum server it corresponds to (for unit testing) and in general. LOG.info( "Leaving listener" ); if (!shutdown) LOG.fatal( "As I'm leaving the listener thread, I won't be able to participate in leader election any longer... digital life sucks" ); Also, I can see the hatred for digital life , but a more useful logging message would be better ! also I am having troble understanding this - synchronized void connectOne( long sid){ if (senderWorkerMap.get(sid) == null ){ InetSocketAddress electionAddr; if (self.quorumPeers.containsKey(sid)) electionAddr = self.quorumPeers.get(sid).electionAddr; else { LOG.warn( "Invalid server id: " + sid); return ; } you mentioned above that connectOne was being called with a sid that wasnt in the map. Is that possible?
        Hide
        Flavio Junqueira added a comment -

        Bad taste humor aside, I think the message is pretty valid: the listener thread is dying and the server won't be able to participate in leader election any longer. If you want, I can definitely remove the last part.

        I'm not sure either how, but all those faults were causing some calls to connectOne with an invalid sid, and the calls originated from receiveConnection. The trunk code assumes that this id is always valid, and it causes some trouble if the sid is not valid.

        Show
        Flavio Junqueira added a comment - Bad taste humor aside, I think the message is pretty valid: the listener thread is dying and the server won't be able to participate in leader election any longer. If you want, I can definitely remove the last part. I'm not sure either how, but all those faults were causing some calls to connectOne with an invalid sid, and the calls originated from receiveConnection. The trunk code assumes that this id is always valid, and it causes some trouble if the sid is not valid.
        Hide
        Mahadev konar added a comment -

        my idea was to atleast add the server id/ipaddress for better logging and some other information if available!

        Show
        Mahadev konar added a comment - my idea was to atleast add the server id/ipaddress for better logging and some other information if available!
        Hide
        Mahadev konar added a comment -

        cancelling patch for flavio to update the logging.

        Show
        Mahadev konar added a comment - cancelling patch for flavio to update the logging.
        Hide
        Flavio Junqueira added a comment -

        Fixing log message and uploading new patch.

        Show
        Flavio Junqueira added a comment - Fixing log message and uploading new patch.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12422891/ZOOKEEPER-512.patch
        against trunk revision 828216.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/36/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/36/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/36/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12422891/ZOOKEEPER-512.patch against trunk revision 828216. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/36/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/36/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h8.grid.sp2.yahoo.net/36/console This message is automatically generated.
        Hide
        Flavio Junqueira added a comment -

        We have been testing this patch externally with Pat's fault injection framework that uses aspectj. It is difficult at this point to introduce his framework, so we have agreed to postpone adding such tests. The patch fixes some visible problems and passes previous tests.

        Show
        Flavio Junqueira added a comment - We have been testing this patch externally with Pat's fault injection framework that uses aspectj. It is difficult at this point to introduce his framework, so we have agreed to postpone adding such tests. The patch fixes some visible problems and passes previous tests.
        Hide
        Mahadev konar added a comment -

        flavio, the patch looks good. Just trying to understand why the following code:

        +            while((!shutdown) && (numRetries < 3)){
        +                try {
        +                    ss = ServerSocketChannel.open();
        +                    int port = self.quorumPeers.get(self.getId()).electionAddr.getPort();
        +                    LOG.info("My election bind port: " + port);
        +                    ss.socket().setReuseAddress(true); 
        +                    ss.socket().bind(new InetSocketAddress(port));
        

        has been moved into the while loop? Why do we need to bind within a while loop?

        Show
        Mahadev konar added a comment - flavio, the patch looks good. Just trying to understand why the following code: + while ((!shutdown) && (numRetries < 3)){ + try { + ss = ServerSocketChannel.open(); + int port = self.quorumPeers.get(self.getId()).electionAddr.getPort(); + LOG.info( "My election bind port: " + port); + ss.socket().setReuseAddress( true ); + ss.socket().bind( new InetSocketAddress(port)); has been moved into the while loop? Why do we need to bind within a while loop?
        Hide
        Flavio Junqueira added a comment -

        This is for retrying. If there is a problem while listening or trying to bind to the socket, it tries again and gives up after 3 consecutive attempts.

        Show
        Flavio Junqueira added a comment - This is for retrying. If there is a problem while listening or trying to bind to the socket, it tries again and gives up after 3 consecutive attempts.
        Hide
        Mahadev konar added a comment -

        +1 the patch looks good.

        Show
        Mahadev konar added a comment - +1 the patch looks good.
        Hide
        Mahadev konar added a comment -

        I just committed this. thanks flavio.

        Show
        Mahadev konar added a comment - I just committed this. thanks flavio.
        Hide
        Hudson added a comment -

        Integrated in ZooKeeper-trunk #511 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/511/)
        . FLE election fails to elect leader (flavio via mahadev)

        Show
        Hudson added a comment - Integrated in ZooKeeper-trunk #511 (See http://hudson.zones.apache.org/hudson/job/ZooKeeper-trunk/511/ ) . FLE election fails to elect leader (flavio via mahadev)

          People

          • Assignee:
            Flavio Junqueira
            Reporter:
            Patrick Hunt
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development