ZooKeeper
  1. ZooKeeper
  2. ZOOKEEPER-1006

QuorumPeer "Address already in use" -- regression in 3.3.3

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 3.3.3
    • Fix Version/s: 3.3.4, 3.4.0
    • Component/s: tests
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      turns out this is a bug in the test, the supplied patch fixes the problem by using polling rather than straight sleep.

      Description

      CnxManagerTest.testWorkerThreads

      See attachment, this is the first time I've seen this test fail, and it's failed 2 out of the last three test runs.

      Notice (attachment) once this happens the port never becomes available.

      2011-03-02 15:53:12,425 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11245:NIOServerCnxn$Factory@251] - Accepted socket connection from /172.29.6.162:51441
      2011-03-02 15:53:12,430 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11245:NIOServerCnxn@639] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
      2011-03-02 15:53:12,430 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:11245:NIOServerCnxn@1435] - Closed socket connection for client /172.29.6.162:51441 (no session established for client)
      2011-03-02 15:53:12,430 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:11241:Follower@82] - Exception when following the leader
      java.io.EOFException
      	at java.io.DataInputStream.readInt(DataInputStream.java:375)
      	at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
      	at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
      	at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
      	at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:148)
      	at org.apache.zookeeper.server.quorum.Learner.registerWithLeader(Learner.java:267)
      	at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:66)
      	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)
      2011-03-02 15:53:12,431 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11241:Follower@165] - shutdown called
      java.lang.Exception: shutdown Follower
      	at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:165)
      	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:649)
      2011-03-02 15:53:12,432 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11241:QuorumPeer@621] - LOOKING
      2011-03-02 15:53:12,432 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11241:FastLeaderElection@663] - New election. My id =  0, Proposed zxid = 0
      2011-03-02 15:53:12,433 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 0 (n.zxid), 2 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state)
      2011-03-02 15:53:12,433 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 0 (n.zxid), 2 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state)
      2011-03-02 15:53:12,433 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 0 (n.zxid), 2 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state)
      2011-03-02 15:53:12,633 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 0 (n.zxid), 2 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state)
      2011-03-02 15:53:12,633 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11245:QuorumPeer@655] - LEADING
      2011-03-02 15:53:12,636 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11245:Leader@54] - TCP NoDelay set to: true
      2011-03-02 15:53:12,638 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11245:ZooKeeperServer@151] - Created server with tickTime 1000 minSessionTimeout 2000 maxSessionTimeout 20000 datadir /var/lib/hudson/workspace/CDH3-ZooKeeper-3.3.3_sles/build/test/tmp/test9001250572426375869.junit.dir/version-2 snapdir /var/lib/hudson/workspace/CDH3-ZooKeeper-3.3.3_sles/build/test/tmp/test9001250572426375869.junit.dir/version-2
      2011-03-02 15:53:12,639 - ERROR [QuorumPeer:/0:0:0:0:0:0:0:0:11245:Leader@133] - Couldn't bind to port 11245
      java.net.BindException: Address already in use
      	at java.net.PlainSocketImpl.socketBind(Native Method)
      	at java.net.PlainSocketImpl.bind(PlainSocketImpl.java:365)
      	at java.net.ServerSocket.bind(ServerSocket.java:319)
      	at java.net.ServerSocket.<init>(ServerSocket.java:185)
      	at java.net.ServerSocket.<init>(ServerSocket.java:97)
      	at org.apache.zookeeper.server.quorum.Leader.<init>(Leader.java:131)
      	at org.apache.zookeeper.server.quorum.QuorumPeer.makeLeader(QuorumPeer.java:512)
      	at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:657)
      
      1. TEST-org.apache.zookeeper.test.CnxManagerTest.txt
        71 kB
        Patrick Hunt
      2. workerthreads_badtest.txt
        308 kB
        Patrick Hunt
      3. ZOOKEEPER-1006.patch
        9 kB
        Patrick Hunt
      4. ZOOKEEPER-1006.patch
        6 kB
        Patrick Hunt

        Activity

        Hide
        Vishal Kher added a comment -

        patch committed to trunk as a part of ZOOKEEPER-880.

        Show
        Vishal Kher added a comment - patch committed to trunk as a part of ZOOKEEPER-880 .
        Hide
        Mahadev konar added a comment -

        SHould this be marked resolved since ZOOKEEPER-880 is fixed?

        Show
        Mahadev konar added a comment - SHould this be marked resolved since ZOOKEEPER-880 is fixed?
        Hide
        Vishal Kher added a comment -

        Patch submitted to ZOOKEEPER-880

        Show
        Vishal Kher added a comment - Patch submitted to ZOOKEEPER-880
        Hide
        Vishal Kher added a comment -

        Fix to be committed to trunk as a part of https://issues.apache.org/jira/browse/ZOOKEEPER-880

        Show
        Vishal Kher added a comment - Fix to be committed to trunk as a part of https://issues.apache.org/jira/browse/ZOOKEEPER-880
        Hide
        Vishal Kher added a comment -

        Hi Patrick,

        Thanks for the fix. I will port the test to the trunk.

        Show
        Vishal Kher added a comment - Hi Patrick, Thanks for the fix. I will port the test to the trunk.
        Hide
        Patrick Hunt added a comment -
        Show
        Patrick Hunt added a comment - I believe Vishal is working on addressing ZOOKEEPER-880 issue (the origin of this test) on trunk in a different jira see this https://issues.apache.org/jira/browse/ZOOKEEPER-880?focusedCommentId=12991286&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12991286
        Hide
        Benjamin Reed added a comment -

        i've committed this to 3.3, but it doesn't apply to trunk. any ideas why?

        Show
        Benjamin Reed added a comment - i've committed this to 3.3, but it doesn't apply to trunk. any ideas why?
        Hide
        Benjamin Reed added a comment -

        +1 looks good thanx!

        Show
        Benjamin Reed added a comment - +1 looks good thanx!
        Hide
        Patrick Hunt added a comment -

        Slating this for 3.3.4

        Show
        Patrick Hunt added a comment - Slating this for 3.3.4
        Hide
        Patrick Hunt added a comment -

        This issue occurred on Apache Hudson: https://hudson.apache.org/hudson/view/S-Z/view/ZooKeeper/job/ZooKeeper_branch_3_3/179/

        I'd like to get this committed to 3.3 branch (if we ever do 3.3.4...) to elim any false failures on hudson, plz review.

        Show
        Patrick Hunt added a comment - This issue occurred on Apache Hudson: https://hudson.apache.org/hudson/view/S-Z/view/ZooKeeper/job/ZooKeeper_branch_3_3/179/ I'd like to get this committed to 3.3 branch (if we ever do 3.3.4...) to elim any false failures on hudson, plz review.
        Hide
        Patrick Hunt added a comment -

        Vishal – would you mind porting this to trunk? I believe you have a separate jira for trunk that incorporates this change, correct?

        Show
        Patrick Hunt added a comment - Vishal – would you mind porting this to trunk? I believe you have a separate jira for trunk that incorporates this change, correct?
        Hide
        Patrick Hunt added a comment -

        This updated patch uses polling rather than timeout (sleep) with a max limit on the poll time that's large.

        I retested this on my hosts and it passes consistently.

        Show
        Patrick Hunt added a comment - This updated patch uses polling rather than timeout (sleep) with a max limit on the poll time that's large. I retested this on my hosts and it passes consistently.
        Hide
        Patrick Hunt added a comment -

        Since I wrote the test, I can do that

        No worries Vishal. I have the h/w where it's failing so makes sense for me to address it (working on it currently). Thanks for the offer though. Keep up the good work (love seeing ppl add tests!). ttyl

        Show
        Patrick Hunt added a comment - Since I wrote the test, I can do that No worries Vishal. I have the h/w where it's failing so makes sense for me to address it (working on it currently). Thanks for the offer though. Keep up the good work (love seeing ppl add tests!). ttyl
        Hide
        Vishal Kher added a comment -

        We need to change this test to remove the sleep. Sleeps in tests are very bad, 1) they tend to fail on slow machines (and more importantly) 2) they make "ant test" take a long time.
        We don't want tests to take a long time. I did a bunch of work a year or two ago to remove all* sleeps from tests, we shouldn't let them creep back in (I realize it's hard to write tests w/o sleep, but it's critical to ensure the tests are fast and testers can rely on the results).

        I agree that we should avoid sleeps, however, on slower machines it is very difficult to give a preditable outcome. The test can fail even if we wait longer. So can the tester really rely on the result on slower machines?

        In general, on a resonably well configured setup, I consider the failure that we saw here as a legitimate failure. The test expects a node to join a running ensemble in initiLimt() * tickTime() * 2. The test is not failing the ensmble (and causing leader election in all 3 nodeS) while restarting the peer. If a peer cannot join in two attempts, then I would think that something is wrong in FLE. So that was my original intention for the timeout.

        btw, an easy fix for this test would be to sleep(250) in a loop around the thread count check. have some max loop count (equiv to say 60seconds of total time) to limit the failure case. In the "success" case the test will complete as soon as the machine can process the test.

        Sure, we can do that. On faster setups, this will let the test to finish sooner. Since I wrote the test, I can do that (if you haven't fixed it already).

        Show
        Vishal Kher added a comment - We need to change this test to remove the sleep. Sleeps in tests are very bad, 1) they tend to fail on slow machines (and more importantly) 2) they make "ant test" take a long time. We don't want tests to take a long time. I did a bunch of work a year or two ago to remove all* sleeps from tests, we shouldn't let them creep back in (I realize it's hard to write tests w/o sleep, but it's critical to ensure the tests are fast and testers can rely on the results). I agree that we should avoid sleeps, however, on slower machines it is very difficult to give a preditable outcome. The test can fail even if we wait longer. So can the tester really rely on the result on slower machines? In general, on a resonably well configured setup, I consider the failure that we saw here as a legitimate failure. The test expects a node to join a running ensemble in initiLimt() * tickTime() * 2. The test is not failing the ensmble (and causing leader election in all 3 nodeS) while restarting the peer. If a peer cannot join in two attempts, then I would think that something is wrong in FLE. So that was my original intention for the timeout. btw, an easy fix for this test would be to sleep(250) in a loop around the thread count check. have some max loop count (equiv to say 60seconds of total time) to limit the failure case. In the "success" case the test will complete as soon as the machine can process the test. Sure, we can do that. On faster setups, this will let the test to finish sooner. Since I wrote the test, I can do that (if you haven't fixed it already).
        Hide
        Patrick Hunt added a comment -

        btw, an easy fix for this test would be to sleep(250) in a loop around the thread count check. have some max loop count (equiv to say 60seconds of total time) to limit the failure case. In the "success" case the test will complete as soon as the machine can process the test.

        Show
        Patrick Hunt added a comment - btw, an easy fix for this test would be to sleep(250) in a loop around the thread count check. have some max loop count (equiv to say 60seconds of total time) to limit the failure case. In the "success" case the test will complete as soon as the machine can process the test.
        Hide
        Patrick Hunt added a comment -

        I just upped the sleep time to *5 (restart peer in testWorkerThread) original and now it passes on both hosts where it was failing previously. (I hate sleeps in tests btw!)

        We need to change this test to remove the sleep. Sleeps in tests are very bad, 1) they tend to fail on slow machines (and more importantly) 2) they make "ant test" take a long time. We don't want tests to take a long time. I did a bunch of work a year or two ago to remove all* sleeps from tests, we shouldn't let them creep back in (I realize it's hard to write tests w/o sleep, but it's critical to ensure the tests are fast and testers can rely on the results).

        Show
        Patrick Hunt added a comment - I just upped the sleep time to *5 (restart peer in testWorkerThread) original and now it passes on both hosts where it was failing previously. (I hate sleeps in tests btw!) We need to change this test to remove the sleep. Sleeps in tests are very bad, 1) they tend to fail on slow machines (and more importantly) 2) they make "ant test" take a long time. We don't want tests to take a long time. I did a bunch of work a year or two ago to remove all* sleeps from tests, we shouldn't let them creep back in (I realize it's hard to write tests w/o sleep, but it's critical to ensure the tests are fast and testers can rely on the results).
        Hide
        Patrick Hunt added a comment -

        here's an example of recent failure on ubuntu:

        2011-03-03 16:25:53,599 - INFO  [main:CnxManagerTest@309] - Round 0, restarting peer 1
        2011-03-03 16:25:53,600 - INFO  [main:FileSnap@82] - Reading snapshot /home/phunt/zookeeper/build/test/tmp/test4354134956302701428.junit.dir/version-2/snapshot.0
        2011-03-03 16:25:53,601 - INFO  [Thread-96:QuorumCnxManager$Listener@473] - My election bind port: 11255
        2011-03-03 16:25:53,602 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11254:QuorumPeer@621] - LOOKING
        2011-03-03 16:25:53,603 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@663] - New election. My id =  1, Proposed zxid = 0
        2011-03-03 16:25:53,637 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 0 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), FOLLOWING (my state)
        2011-03-03 16:25:53,637 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state)
        2011-03-03 16:25:53,638 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@721] - Updating proposal
        2011-03-03 16:25:53,638 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 0 (n.sid), LOOKING (my state)
        2011-03-03 16:25:53,638 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 0 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
        2011-03-03 16:25:53,654 - INFO  [WorkerSender Thread:QuorumCnxManager@183] - Have smaller server identifier, so dropping the connection: (2, 1)
        2011-03-03 16:25:53,654 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), FOLLOWING (my state)
        2011-03-03 16:25:53,654 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
        2011-03-03 16:25:53,655 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 0 (n.sid), LOOKING (my state)
        2011-03-03 16:25:53,657 - INFO  [WorkerSender Thread:QuorumCnxManager@183] - Have smaller server identifier, so dropping the connection: (2, 1)
        2011-03-03 16:25:53,657 - WARN  [Thread-102:QuorumCnxManager$RecvWorker@727] - Connection broken for id 1, my id = 2, error = java.nio.channels.AsynchronousCloseException
        2011-03-03 16:25:53,658 - WARN  [Thread-102:QuorumCnxManager$RecvWorker@730] - Interrupting SendWorker
        2011-03-03 16:25:53,687 - WARN  [Thread-103:QuorumCnxManager$SendWorker@638] - Exception when using channel: for id 2 my id = 1 error = java.io.IOException: Broken pipe
        2011-03-03 16:25:53,687 - WARN  [Thread-104:QuorumCnxManager$RecvWorker@727] - Connection broken for id 2, my id = 1, error = java.nio.channels.ClosedChannelException
        2011-03-03 16:25:53,687 - WARN  [Thread-104:QuorumCnxManager$RecvWorker@730] - Interrupting SendWorker
        2011-03-03 16:25:53,688 - ERROR [Thread-101:QuorumCnxManager$SendWorker@611] - Failed to send last message. Shutting down thread.
        java.nio.channels.ClosedChannelException
                at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126)
                at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324)
                at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.send(QuorumCnxManager.java:599)
                at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:609)
        2011-03-03 16:25:53,688 - WARN  [Thread-101:QuorumCnxManager$SendWorker@642] - Send worker leaving thread
        2011-03-03 16:25:53,688 - WARN  [Thread-103:QuorumCnxManager$SendWorker@642] - Send worker leaving thread
        2011-03-03 16:25:53,855 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@697] - Notification time out: 400
        2011-03-03 16:25:53,855 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
        2011-03-03 16:25:53,856 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), FOLLOWING (my state)
        2011-03-03 16:25:53,856 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 0 (n.sid), LOOKING (my state)
        2011-03-03 16:25:54,256 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@697] - Notification time out: 800
        2011-03-03 16:25:55,057 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@697] - Notification time out: 1600
        2011-03-03 16:25:56,657 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@697] - Notification time out: 3200
        ------------- ---------------- ---------------
        FAILURE
        
        Show
        Patrick Hunt added a comment - here's an example of recent failure on ubuntu: 2011-03-03 16:25:53,599 - INFO [main:CnxManagerTest@309] - Round 0, restarting peer 1 2011-03-03 16:25:53,600 - INFO [main:FileSnap@82] - Reading snapshot /home/phunt/zookeeper/build/test/tmp/test4354134956302701428.junit.dir/version-2/snapshot.0 2011-03-03 16:25:53,601 - INFO [Thread-96:QuorumCnxManager$Listener@473] - My election bind port: 11255 2011-03-03 16:25:53,602 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:11254:QuorumPeer@621] - LOOKING 2011-03-03 16:25:53,603 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@663] - New election. My id = 1, Proposed zxid = 0 2011-03-03 16:25:53,637 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 0 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), FOLLOWING (my state) 2011-03-03 16:25:53,637 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state) 2011-03-03 16:25:53,638 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@721] - Updating proposal 2011-03-03 16:25:53,638 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 0 (n.sid), LOOKING (my state) 2011-03-03 16:25:53,638 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 0 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state) 2011-03-03 16:25:53,654 - INFO [WorkerSender Thread:QuorumCnxManager@183] - Have smaller server identifier, so dropping the connection: (2, 1) 2011-03-03 16:25:53,654 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), FOLLOWING (my state) 2011-03-03 16:25:53,654 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state) 2011-03-03 16:25:53,655 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 0 (n.sid), LOOKING (my state) 2011-03-03 16:25:53,657 - INFO [WorkerSender Thread:QuorumCnxManager@183] - Have smaller server identifier, so dropping the connection: (2, 1) 2011-03-03 16:25:53,657 - WARN [Thread-102:QuorumCnxManager$RecvWorker@727] - Connection broken for id 1, my id = 2, error = java.nio.channels.AsynchronousCloseException 2011-03-03 16:25:53,658 - WARN [Thread-102:QuorumCnxManager$RecvWorker@730] - Interrupting SendWorker 2011-03-03 16:25:53,687 - WARN [Thread-103:QuorumCnxManager$SendWorker@638] - Exception when using channel: for id 2 my id = 1 error = java.io.IOException: Broken pipe 2011-03-03 16:25:53,687 - WARN [Thread-104:QuorumCnxManager$RecvWorker@727] - Connection broken for id 2, my id = 1, error = java.nio.channels.ClosedChannelException 2011-03-03 16:25:53,687 - WARN [Thread-104:QuorumCnxManager$RecvWorker@730] - Interrupting SendWorker 2011-03-03 16:25:53,688 - ERROR [Thread-101:QuorumCnxManager$SendWorker@611] - Failed to send last message. Shutting down thread. java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:126) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.send(QuorumCnxManager.java:599) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:609) 2011-03-03 16:25:53,688 - WARN [Thread-101:QuorumCnxManager$SendWorker@642] - Send worker leaving thread 2011-03-03 16:25:53,688 - WARN [Thread-103:QuorumCnxManager$SendWorker@642] - Send worker leaving thread 2011-03-03 16:25:53,855 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@697] - Notification time out: 400 2011-03-03 16:25:53,855 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state) 2011-03-03 16:25:53,856 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 1 (n.sid), FOLLOWING (my state) 2011-03-03 16:25:53,856 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 1 (n.round), FOLLOWING (n.state), 0 (n.sid), LOOKING (my state) 2011-03-03 16:25:54,256 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@697] - Notification time out: 800 2011-03-03 16:25:55,057 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@697] - Notification time out: 1600 2011-03-03 16:25:56,657 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:11254:FastLeaderElection@697] - Notification time out: 3200 ------------- ---------------- --------------- FAILURE
        Hide
        Patrick Hunt added a comment -

        This patch fixes the address in use issue, but still fails. I just tried this test on ubuntu and it fails there now too (was passing previously)

        apply the patch then:

        ant -Dtestcase=CnxManagerTest clean test-core-java

        Show
        Patrick Hunt added a comment - This patch fixes the address in use issue, but still fails. I just tried this test on ubuntu and it fails there now too (was passing previously) apply the patch then: ant -Dtestcase=CnxManagerTest clean test-core-java
        Hide
        Patrick Hunt added a comment -

        Attached a log of the test failure after fixing the port conflict problem.

        Show
        Patrick Hunt added a comment - Attached a log of the test failure after fixing the port conflict problem.
        Hide
        Patrick Hunt added a comment -

        Yea I noticed that right away and fixed the port assignment issue, however the testWorkerThread test still fails:

        Notice that there are no logs for 4 seconds:

        2011-03-03 15:11:43,275 - INFO  [main:CnxManagerTest@309] - Round 2, restarting peer 0
        2011-03-03 15:11:43,276 - INFO  [main:FileSnap@82] - Reading snapshot /tmp/phunt-cdh3/zookeeper/build/test/tmp/test579996210206510713.junit.dir/version-2/snapshot.100000000
        2011-03-03 15:11:43,277 - INFO  [Thread-83:QuorumCnxManager$Listener@473] - My election bind port: 11252
        2011-03-03 15:11:43,286 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11251:QuorumPeer@621] - LOOKING
        2011-03-03 15:11:43,286 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:11251:FastLeaderElection@663] - New election. My id =  0, Proposed zxid = 4294967296
        2011-03-03 15:11:43,286 - INFO  [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state)
        ------------- ---------------- ---------------
        
        Testcase: testCnxManager took 0.315 sec
        Testcase: testCnxManagerTimeout took 5.014 sec
        Testcase: testCnxManagerSpinLock took 2.053 sec
        Testcase: testWorkerThreads took 34.183 sec
                FAILED
        Thu Mar 03 15:11:47 PST 2011 Incorrect number of Worker threads for sid=0 expected 4 found 0
        junit.framework.AssertionFailedError: Thu Mar 03 15:11:47 PST 2011 Incorrect number of Worker threads for sid=0 expected 4 found 0
                at org.apache.zookeeper.test.CnxManagerTest.verifyThreadCount(CnxManagerTest.java:324)
                at org.apache.zookeeper.test.CnxManagerTest.testWorkerThreads(CnxManagerTest.java:313)
        
        Show
        Patrick Hunt added a comment - Yea I noticed that right away and fixed the port assignment issue, however the testWorkerThread test still fails: Notice that there are no logs for 4 seconds: 2011-03-03 15:11:43,275 - INFO [main:CnxManagerTest@309] - Round 2, restarting peer 0 2011-03-03 15:11:43,276 - INFO [main:FileSnap@82] - Reading snapshot /tmp/phunt-cdh3/zookeeper/build/test/tmp/test579996210206510713.junit.dir/version-2/snapshot.100000000 2011-03-03 15:11:43,277 - INFO [Thread-83:QuorumCnxManager$Listener@473] - My election bind port: 11252 2011-03-03 15:11:43,286 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:11251:QuorumPeer@621] - LOOKING 2011-03-03 15:11:43,286 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:11251:FastLeaderElection@663] - New election. My id = 0, Proposed zxid = 4294967296 2011-03-03 15:11:43,286 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 0 (n.leader), 4294967296 (n.zxid), 1 (n.round), LOOKING (n.state), 0 (n.sid), LOOKING (my state) ------------- ---------------- --------------- Testcase: testCnxManager took 0.315 sec Testcase: testCnxManagerTimeout took 5.014 sec Testcase: testCnxManagerSpinLock took 2.053 sec Testcase: testWorkerThreads took 34.183 sec FAILED Thu Mar 03 15:11:47 PST 2011 Incorrect number of Worker threads for sid=0 expected 4 found 0 junit.framework.AssertionFailedError: Thu Mar 03 15:11:47 PST 2011 Incorrect number of Worker threads for sid=0 expected 4 found 0 at org.apache.zookeeper.test.CnxManagerTest.verifyThreadCount(CnxManagerTest.java:324) at org.apache.zookeeper.test.CnxManagerTest.testWorkerThreads(CnxManagerTest.java:313)
        Hide
        Flavio Junqueira added a comment -

        If I remember correctly, the fact that the port value is reused doesn't matter for the test cases in CnxManagerTest pre-880, since they don't start a zookeeper server. Apparently it does matter for a test case introduced in ZOOKEEPER-880 (testWorkerThreads).

        Show
        Flavio Junqueira added a comment - If I remember correctly, the fact that the port value is reused doesn't matter for the test cases in CnxManagerTest pre-880, since they don't start a zookeeper server. Apparently it does matter for a test case introduced in ZOOKEEPER-880 (testWorkerThreads).
        Hide
        Benjamin Reed added a comment -

        that exception is a red herring. it turns out we always get that exception. there is a bug in the test case. in setup() we have:

                    int clientport = PortAssignment.unique();
                    peers.put(Long.valueOf(i),
                            new QuorumServer(i,
                                    new InetSocketAddress(clientport),
                            new InetSocketAddress(PortAssignment.unique())));
                    tmpdir[i] = ClientBase.createTmpDir();
                    port[i] = clientport;
        

        so the clientport is getting assigned to the same port as the quorum port. we really need to construct QuorumServer with PortAssignment.unique() instead.

        Pat since you seem to be able to reproduce it, can you try that fix and see if it changes anything?

        Show
        Benjamin Reed added a comment - that exception is a red herring. it turns out we always get that exception. there is a bug in the test case. in setup() we have: int clientport = PortAssignment.unique(); peers.put( Long .valueOf(i), new QuorumServer(i, new InetSocketAddress(clientport), new InetSocketAddress(PortAssignment.unique()))); tmpdir[i] = ClientBase.createTmpDir(); port[i] = clientport; so the clientport is getting assigned to the same port as the quorum port. we really need to construct QuorumServer with PortAssignment.unique() instead. Pat since you seem to be able to reproduce it, can you try that fix and see if it changes anything?
        Hide
        Vishal Kher added a comment -

        Didn't realize it was the test for worker threads. I have not seen that error during this test. I will take a look.

        Show
        Vishal Kher added a comment - Didn't realize it was the test for worker threads. I have not seen that error during this test. I will take a look.
        Hide
        Patrick Hunt added a comment -

        I ran netstat and checked to see if there were any zombies - none found.

        This test just failed again...

        Show
        Patrick Hunt added a comment - I ran netstat and checked to see if there were any zombies - none found. This test just failed again...
        Hide
        Patrick Hunt added a comment -

        Timing? JVM? This happens to be a hudson machine, the slot is virtualized (vmware).

        more detail:
        Linux sles11-slave02 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux

        dual:
        model name : Intel(R) Xeon(R) CPU E5506 @ 2.13GHz

        Server environment:java.version=1.6.0_21

        Show
        Patrick Hunt added a comment - Timing? JVM? This happens to be a hudson machine, the slot is virtualized (vmware). more detail: Linux sles11-slave02 2.6.27.19-5-default #1 SMP 2009-02-28 04:40:21 +0100 x86_64 x86_64 x86_64 GNU/Linux dual: model name : Intel(R) Xeon(R) CPU E5506 @ 2.13GHz Server environment:java.version=1.6.0_21
        Hide
        Vishal Kher added a comment -

        May not be a regression (ZOOKEEPER-973).

        Show
        Vishal Kher added a comment - May not be a regression ( ZOOKEEPER-973 ).
        Hide
        Benjamin Reed added a comment -

        or my machine for that matter

        Show
        Benjamin Reed added a comment - or my machine for that matter
        Hide
        Benjamin Reed added a comment -

        why doesn't this show up in the nightly test builds?

        Show
        Benjamin Reed added a comment - why doesn't this show up in the nightly test builds?
        Hide
        Patrick Hunt added a comment -

        attached output of bad test

        Show
        Patrick Hunt added a comment - attached output of bad test

          People

          • Assignee:
            Patrick Hunt
            Reporter:
            Patrick Hunt
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development