Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: 3.4.0
    • Component/s: None
    • Labels:
      None

      Description

      the hudson test failed on http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/2/testReport/. There are huge set of cancelledkeyexceptions in the logs. Still going through the logs to find out the reason for failure.

      1. nohup-QuorumPeerMainTest-201004301209.txt
        179 kB
        Alan Cabrera
      2. jstack-AsyncHammerTest-201004301209.txt
        10 kB
        Alan Cabrera
      3. nohup-AsyncHammerTest-201004301209.txt
        9.02 MB
        Alan Cabrera
      4. ZOOKEEPER-690.patch
        11 kB
        Henry Robinson
      5. nohup-201004291527.txt
        2.05 MB
        Alan Cabrera
      6. jstack-201004291527.txt
        12 kB
        Alan Cabrera
      7. ZOOKEEPER-690.patch
        10 kB
        Henry Robinson
      8. jstack-201004291409.txt
        46 kB
        Alan Cabrera
      9. nohup-201004291409.txt
        4.69 MB
        Alan Cabrera
      10. ZOOKEEPER-690.patch
        10 kB
        Henry Robinson
      11. nohup-201004201053.txt
        4.34 MB
        Alan Cabrera
      12. jstack-201004201053.txt
        46 kB
        Alan Cabrera
      13. zoo.log
        49 kB
        Alan Cabrera
      14. TEST-org.apache.zookeeper.test.AsyncHammerTest.txt
        2.23 MB
        Alan Cabrera

        Issue Links

          Activity

          Hide
          Mahadev konar added a comment -

          Its been too long for this jira. I cant reproduce it now.

          Show
          Mahadev konar added a comment - Its been too long for this jira. I cant reproduce it now.
          Hide
          Flavio Junqueira added a comment -

          Hi Alan, It does not block for me, with or without the patch. My configuration is pretty much like your "fine" one.

          Show
          Flavio Junqueira added a comment - Hi Alan, It does not block for me, with or without the patch. My configuration is pretty much like your "fine" one.
          Hide
          Alan Cabrera added a comment -

          I understand that it works for him. Just wondering if he needs the patch to get it to work.

          Fine:

          MacBook Pro - Intel Core 2 Duo 2.53 GHz, 4 GB
          Mac OS X 10.5.8 - all updates applied
          Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-9M3125)
          Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)
          Using WiFi through Cisco VPN

          Not so good:

          MacPro - 2 Quad-Core Intel Xeon 2.26GHz, 24GB
          Mac OS X 10.5.8 - all updates applied
          Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-9M3125)
          Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)
          One Ethernet port used

          If you're interested I might be able to clean this machine and put it in a DMZ for you all to tinker with.

          Show
          Alan Cabrera added a comment - I understand that it works for him. Just wondering if he needs the patch to get it to work. Fine: MacBook Pro - Intel Core 2 Duo 2.53 GHz, 4 GB Mac OS X 10.5.8 - all updates applied Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-9M3125) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode) Using WiFi through Cisco VPN Not so good: MacPro - 2 Quad-Core Intel Xeon 2.26GHz, 24GB Mac OS X 10.5.8 - all updates applied Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-9M3125) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode) One Ethernet port used If you're interested I might be able to clean this machine and put it in a DMZ for you all to tinker with.
          Hide
          Patrick Hunt added a comment -

          I think Flavio is saying it works for him.

          Alan, any idea what's different btw the versions that are running fine, and the one where it fails? Java VM version? CPU core count, networking setup, is there something that stands out?

          Show
          Patrick Hunt added a comment - I think Flavio is saying it works for him. Alan, any idea what's different btw the versions that are running fine, and the one where it fails? Java VM version? CPU core count, networking setup, is there something that stands out?
          Hide
          Flavio Junqueira added a comment -

          Removing the blocker flag and making it major for 3.4.0.

          Show
          Flavio Junqueira added a comment - Removing the blocker flag and making it major for 3.4.0.
          Hide
          Alan Cabrera added a comment -

          Are you saying that the AsyncHammerTest also fails on your machines w/ out the patch? I ask this because trunk, with no patches, builds and tests fine on all my other Macs except the one at work.

          My JAVA_HOME is properly set.

          Show
          Alan Cabrera added a comment - Are you saying that the AsyncHammerTest also fails on your machines w/ out the patch? I ask this because trunk , with no patches, builds and tests fine on all my other Macs except the one at work. My JAVA_HOME is properly set.
          Hide
          Flavio Junqueira added a comment -

          +1, the patch looks good.

          I believe I have the same setup as Alan: Mac OS X 10.5.8 and the same java version. It runs fine for me. I have run from Eclipse and from the command line multiple times, and I observed no problem. I have also asked a colleague with the same setup to try to reproduce, and also runs for my colleague.

          Alan, could you make sure that JAVA_HOME is set to the correct path before you run ant from the command line? If the variable is not set, then ant might not pick the correct java version to run, even if it is marked in the java preferences utility.

          We should keep investigating this problem, but given that it runs in two computers with similar configurations, I recommend we remove the blocker flag for this issue.

          Show
          Flavio Junqueira added a comment - +1, the patch looks good. I believe I have the same setup as Alan: Mac OS X 10.5.8 and the same java version. It runs fine for me. I have run from Eclipse and from the command line multiple times, and I observed no problem. I have also asked a colleague with the same setup to try to reproduce, and also runs for my colleague. Alan, could you make sure that JAVA_HOME is set to the correct path before you run ant from the command line? If the variable is not set, then ant might not pick the correct java version to run, even if it is marked in the java preferences utility. We should keep investigating this problem, but given that it runs in two computers with similar configurations, I recommend we remove the blocker flag for this issue.
          Hide
          Alan Cabrera added a comment -

          Correct. The child seem to have completed. The parent was the one that seemed to have locked up.

          Show
          Alan Cabrera added a comment - Correct. The child seem to have completed. The parent was the one that seemed to have locked up.
          Hide
          Benjamin Reed added a comment -

          alan it looks like you took the jstack of the parent process rather than the child.

          Show
          Benjamin Reed added a comment - alan it looks like you took the jstack of the parent process rather than the child.
          Hide
          Henry Robinson added a comment -

          Weird - it looks like the test is shutting down correctly:

          [junit] 2010-04-30 11:41:52,896 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11233
          [junit] 2010-04-30 11:41:52,896 - INFO [main:QuorumBase@277] - 127.0.0.1:11233 is no longer accepting client connections
          [junit] 2010-04-30 11:41:52,896 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11234
          [junit] 2010-04-30 11:41:52,897 - INFO [main:QuorumBase@277] - 127.0.0.1:11234 is no longer accepting client connections
          [junit] 2010-04-30 11:41:52,897 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11235
          [junit] 2010-04-30 11:41:52,897 - INFO [main:QuorumBase@277] - 127.0.0.1:11235 is no longer accepting client connections
          [junit] 2010-04-30 11:41:52,897 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11236
          [junit] 2010-04-30 11:41:52,898 - INFO [main:QuorumBase@277] - 127.0.0.1:11236 is no longer accepting client connections
          [junit] 2010-04-30 11:41:52,898 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11237
          [junit] 2010-04-30 11:41:52,898 - INFO [main:QuorumBase@277] - 127.0.0.1:11237 is no longer accepting client connections
          [junit] 2010-04-30 11:41:52,901 - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@56] - FINISHED TEST METHOD testObserversHammer
          [junit] 2010-04-30 11:41:52,901 - INFO [main:ZKTestCase$1@59] - SUCCEEDED testObserversHammer
          [junit] 2010-04-30 11:41:52,901 - INFO [main:ZKTestCase$1@54] - FINISHED testObserversHammer

          and then it goes into trying the C tests which fail for an unrelated reason - does it lock up at this point or does it actually fail out to the CLI? If it locks up, is the jstack output you attached from that run?

          Show
          Henry Robinson added a comment - Weird - it looks like the test is shutting down correctly: [junit] 2010-04-30 11:41:52,896 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11233 [junit] 2010-04-30 11:41:52,896 - INFO [main:QuorumBase@277] - 127.0.0.1:11233 is no longer accepting client connections [junit] 2010-04-30 11:41:52,896 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11234 [junit] 2010-04-30 11:41:52,897 - INFO [main:QuorumBase@277] - 127.0.0.1:11234 is no longer accepting client connections [junit] 2010-04-30 11:41:52,897 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11235 [junit] 2010-04-30 11:41:52,897 - INFO [main:QuorumBase@277] - 127.0.0.1:11235 is no longer accepting client connections [junit] 2010-04-30 11:41:52,897 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11236 [junit] 2010-04-30 11:41:52,898 - INFO [main:QuorumBase@277] - 127.0.0.1:11236 is no longer accepting client connections [junit] 2010-04-30 11:41:52,898 - INFO [main:ClientBase@222] - connecting to 127.0.0.1 11237 [junit] 2010-04-30 11:41:52,898 - INFO [main:QuorumBase@277] - 127.0.0.1:11237 is no longer accepting client connections [junit] 2010-04-30 11:41:52,901 - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@56] - FINISHED TEST METHOD testObserversHammer [junit] 2010-04-30 11:41:52,901 - INFO [main:ZKTestCase$1@59] - SUCCEEDED testObserversHammer [junit] 2010-04-30 11:41:52,901 - INFO [main:ZKTestCase$1@54] - FINISHED testObserversHammer and then it goes into trying the C tests which fail for an unrelated reason - does it lock up at this point or does it actually fail out to the CLI? If it locks up, is the jstack output you attached from that run?
          Hide
          Alan Cabrera added a comment -

          The first run is the run where Ant locks up. The second run is the subsequent run after the first run zombies.

          Rebooting clears everything for the first test run to lock up and become zombied.

          Show
          Alan Cabrera added a comment - The first run is the run where Ant locks up. The second run is the subsequent run after the first run zombies. Rebooting clears everything for the first test run to lock up and become zombied.
          Hide
          Henry Robinson added a comment -

          Hi Alan -

          Looking at this attachment: nohup-AsyncHammerTest-201004301209.txt - the tests appear to be run twice. The first testObserversHammer completes successfully, the second fails. Were you running the tests until you experienced the failure?

          Henry

          Show
          Henry Robinson added a comment - Hi Alan - Looking at this attachment: nohup-AsyncHammerTest-201004301209.txt - the tests appear to be run twice. The first testObserversHammer completes successfully, the second fails. Were you running the tests until you experienced the failure? Henry
          Hide
          Alan Cabrera added a comment -

          The test still locks up but it seems that it's Ant that freezes. This process then becomes a zombie process. This zombie process holds on to ports that the tests use and so subsequent runs fail. Notably, QuorumPeerMainTest takes 15 minutes to fail.

          Show
          Alan Cabrera added a comment - The test still locks up but it seems that it's Ant that freezes. This process then becomes a zombie process. This zombie process holds on to ports that the tests use and so subsequent runs fail. Notably, QuorumPeerMainTest takes 15 minutes to fail.
          Hide
          Henry Robinson added a comment -

          Hi Alan - Do you think you might have a chance to test whether this patch improves things?

          Show
          Henry Robinson added a comment - Hi Alan - Do you think you might have a chance to test whether this patch improves things?
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12443246/ZOOKEEPER-690.patch
          against trunk revision 939172.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 12 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/77/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/77/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/77/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12443246/ZOOKEEPER-690.patch against trunk revision 939172. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 12 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/77/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/77/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Zookeeper-Patch-h1.grid.sp2.yahoo.net/77/console This message is automatically generated.
          Hide
          Henry Robinson added a comment -

          Ben -

          Agreed. I see this as the same as setMyid(...) - it sets an immutable value and should only be called once. I'd prefer if these parameters were 'final' in QuorumPeer and set in the constructor, but that's not the way that runFromConfig (the only place outside of tests that these methods are called) is written. Then we could get rid of setLearnerType, for sure.

          The real error here, I think, is duplicating the learnertype between QuorumPeer and QuorumServer. If we are going to have the list of QuorumServers, then getLearnerType should lookup the learner type in the peer map. Same for the serverid, perhaps, and we should just save a reference to the QuorumServer that represents our Quorumpeer.

          Show
          Henry Robinson added a comment - Ben - Agreed. I see this as the same as setMyid(...) - it sets an immutable value and should only be called once. I'd prefer if these parameters were 'final' in QuorumPeer and set in the constructor, but that's not the way that runFromConfig (the only place outside of tests that these methods are called) is written. Then we could get rid of setLearnerType, for sure. The real error here, I think, is duplicating the learnertype between QuorumPeer and QuorumServer. If we are going to have the list of QuorumServers, then getLearnerType should lookup the learner type in the peer map. Same for the serverid, perhaps, and we should just save a reference to the QuorumServer that represents our Quorumpeer.
          Hide
          Benjamin Reed added a comment -

          henry, i think this may show that we can't really have a setLearnerType() method. In the real distributed setting, each peer will have its own list, so we should really think of the peers list as immutable.

          Show
          Benjamin Reed added a comment - henry, i think this may show that we can't really have a setLearnerType() method. In the real distributed setting, each peer will have its own list, so we should really think of the peers list as immutable.
          Hide
          Henry Robinson added a comment -

          Fixing that bug unmasked another ugliness with the way that tests were started and stopped. This patch makes the process cleaner. Let's see if it helps, Alan?

          Show
          Henry Robinson added a comment - Fixing that bug unmasked another ugliness with the way that tests were started and stopped. This patch makes the process cleaner. Let's see if it helps, Alan?
          Hide
          Alan Cabrera added a comment -

          Still locks up. (BTW, I applied the latest patch to a clean checkout of zookeeper)

          Show
          Alan Cabrera added a comment - Still locks up. (BTW, I applied the latest patch to a clean checkout of zookeeper)
          Hide
          Henry Robinson added a comment -

          Alan - would you mind trying this new patch? Thanks for your patience. I suspect that something might still be a bit flaky with these tests (not the code, but the tests), but I hope this will fix this particular problem.

          Show
          Henry Robinson added a comment - Alan - would you mind trying this new patch? Thanks for your patience. I suspect that something might still be a bit flaky with these tests (not the code, but the tests), but I hope this will fix this particular problem.
          Hide
          Henry Robinson added a comment -

          This map is, I think, shared between the quorumpeers for the purposes of the test (and in general there aren't two quorumpeers sharing this datastructure when running normally).

          But! The error here is that I'm dumb (and that Java's type-checking leaves a little to be desired). I've written quorumPeers.containsValue up there, but actually it should be quorumPeers.containsKey. New patch on the way, let's see if that fixes it.

          Show
          Henry Robinson added a comment - This map is, I think, shared between the quorumpeers for the purposes of the test (and in general there aren't two quorumpeers sharing this datastructure when running normally). But! The error here is that I'm dumb (and that Java's type-checking leaves a little to be desired). I've written quorumPeers.containsValue up there, but actually it should be quorumPeers.containsKey. New patch on the way, let's see if that fixes it.
          Hide
          Benjamin Reed added a comment -

          i think the key fix is here:

          public void setLearnerType(LearnerType p) {
          learnerType = p;
          if (quorumPeers.containsValue(this.myid))

          Unknown macro: { this.quorumPeers.get(myid).type = p; }

          else

          Unknown macro: { LOG.error("Setting LearnerType to " + p + " but " + myid + " not in QuorumPeers. "); }

          ...
          }

          right?

          the problem i see is that we are only updating the quorumPeers for the one peer. the other peers are going to be thinking it is a participant.

          Show
          Benjamin Reed added a comment - i think the key fix is here: public void setLearnerType(LearnerType p) { learnerType = p; if (quorumPeers.containsValue(this.myid)) Unknown macro: { this.quorumPeers.get(myid).type = p; } else Unknown macro: { LOG.error("Setting LearnerType to " + p + " but " + myid + " not in QuorumPeers. "); } ... } right? the problem i see is that we are only updating the quorumPeers for the one peer. the other peers are going to be thinking it is a participant.
          Hide
          Alan Cabrera added a comment -

          Disregard the last comment.

          I have now cleaned and rebuilt and now it still locks up.

          Show
          Alan Cabrera added a comment - Disregard the last comment. I have now cleaned and rebuilt and now it still locks up.
          Hide
          Henry Robinson added a comment -

          Note that both testHammer and testObserversHammer fail in the most recent log.

          Show
          Henry Robinson added a comment - Note that both testHammer and testObserversHammer fail in the most recent log.
          Hide
          Alan Cabrera added a comment -

          Test does not lock up but it does not pass.

          Show
          Alan Cabrera added a comment - Test does not lock up but it does not pass.
          Hide
          Henry Robinson added a comment -

          Alan - can you try this patch to see if it fixes things?

          Thanks,

          Henry

          Show
          Henry Robinson added a comment - Alan - can you try this patch to see if it fixes things? Thanks, Henry
          Hide
          Henry Robinson added a comment -

          I have found what I hope is the problem.

          Because QuorumPeers duplicate their 'LearnerType' in two places there's the possibility that they may get out of sync. This is what was happening here - it was a test bug. Although the Observers knew that they were Observers, the other nodes did not. This affected the leader election protocol as other node did not know to reject an Observer.

          I feel like we should refactor the QuorumPeer.QuorumServer code so as not to duplicate information, but for the time being I think this patch will work.

          I have also taken the opportunity to standardise the naming of 'learnertype' throughout the code (in some places it was called 'peertype' adding to the confusion).

          Tests pass on my machine, but I can't guarantee that the problem is fixed as I could never recreate the error.

          Thanks to Flavio for catching the broken invariant!

          Show
          Henry Robinson added a comment - I have found what I hope is the problem. Because QuorumPeers duplicate their 'LearnerType' in two places there's the possibility that they may get out of sync. This is what was happening here - it was a test bug. Although the Observers knew that they were Observers, the other nodes did not. This affected the leader election protocol as other node did not know to reject an Observer. I feel like we should refactor the QuorumPeer.QuorumServer code so as not to duplicate information, but for the time being I think this patch will work. I have also taken the opportunity to standardise the naming of 'learnertype' throughout the code (in some places it was called 'peertype' adding to the confusion). Tests pass on my machine, but I can't guarantee that the problem is fixed as I could never recreate the error. Thanks to Flavio for catching the broken invariant!
          Hide
          Flavio Junqueira added a comment -

          FLE relies upon a correct implementation of the voting view. My understanding is that if an observer is being elected leader, then the following predicate is evaluating to false:

          (!self.getVotingView().containsKey(response.sid))
          

          This is line 201 of FastLeaderElection.java. If this predicate is true, meaning that an observer is not in the voting view of a server, then the server will send a response right away and won't consider the vote of the observer. Does it make sense?

          Show
          Flavio Junqueira added a comment - FLE relies upon a correct implementation of the voting view. My understanding is that if an observer is being elected leader, then the following predicate is evaluating to false: (!self.getVotingView().containsKey(response.sid)) This is line 201 of FastLeaderElection.java. If this predicate is true, meaning that an observer is not in the voting view of a server, then the server will send a response right away and won't consider the vote of the observer. Does it make sense?
          Hide
          Henry Robinson added a comment -

          Progress update - possibly to do with a bug in FLE allowing an Observer to be elected. We're looking into this now.

          Show
          Henry Robinson added a comment - Progress update - possibly to do with a bug in FLE allowing an Observer to be elected. We're looking into this now.
          Hide
          Patrick Hunt added a comment -

          Moving this as a blocker for 3.3.1 - I think we need to fix this as part of the release.

          Henry, any progress? It would be great to get 3.3.1 to a vote soon (to get out a fix for the stat issue)

          Show
          Patrick Hunt added a comment - Moving this as a blocker for 3.3.1 - I think we need to fix this as part of the release. Henry, any progress? It would be great to get 3.3.1 to a vote soon (to get out a fix for the stat issue)
          Hide
          Patrick Hunt added a comment -

          The Observer test is failing, and it looks like one of the observer servers might be the issue (11247 is the fifth server afaict), Henry any insight?

          [junit] 2010-04-20 10:48:21,936 - INFO [QuorumPeer:/0.0.0.0:11247:QuorumPeer@619] - OBSERVING

          [junit] 2010-04-20 10:49:42,593 - INFO [main:ClientBase@536] - Found match with array:[389, 389, 389, 389, 0]
          [junit] 2010-04-20 10:49:42,594 - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@53] - TEST METHOD FAILED testObserversHammer
          [junit] java.lang.AssertionError: node count not consistent expected:<389> but was:<0>

          Show
          Patrick Hunt added a comment - The Observer test is failing, and it looks like one of the observer servers might be the issue (11247 is the fifth server afaict), Henry any insight? [junit] 2010-04-20 10:48:21,936 - INFO [QuorumPeer:/0.0.0.0:11247:QuorumPeer@619] - OBSERVING [junit] 2010-04-20 10:49:42,593 - INFO [main:ClientBase@536] - Found match with array: [389, 389, 389, 389, 0] [junit] 2010-04-20 10:49:42,594 - INFO [main:JUnit4ZKTestRunner$LoggedInvokeMethod@53] - TEST METHOD FAILED testObserversHammer [junit] java.lang.AssertionError: node count not consistent expected:<389> but was:<0>
          Hide
          Alan Cabrera added a comment -

          One interesting thing I noticed is that after I kill the process it turns into a zombie process.

          Show
          Alan Cabrera added a comment - One interesting thing I noticed is that after I kill the process it turns into a zombie process.
          Hide
          Alan Cabrera added a comment -

          Stack dump and log.

          Show
          Alan Cabrera added a comment - Stack dump and log.
          Hide
          Patrick Hunt added a comment -

          Right, also > test.out 2>&1

          will help to capture the output.

          Show
          Patrick Hunt added a comment - Right, also > test.out 2>&1 will help to capture the output.
          Hide
          Mahadev konar added a comment -

          alan,
          junit doesnt persist the logs into the testfiles in case the test hangs. So, you will have to print out the logs on the console so that you know where it is hanging -

          try

          ant test -Dtestcase=AsyncHammerTest -Dtest.output=yes

          Show
          Mahadev konar added a comment - alan, junit doesnt persist the logs into the testfiles in case the test hangs. So, you will have to print out the logs on the console so that you know where it is hanging - try ant test -Dtestcase=AsyncHammerTest -Dtest.output=yes
          Hide
          Alan Cabrera added a comment -

          There's nothing in the log:

          [acabrera-md:zookeeper 507]$ cat build/test/logs/TEST-org.apache.zookeeper.test.AsyncHammerTest.txt
          Testsuite: org.apache.zookeeper.test.AsyncHammerTest
          
          Show
          Alan Cabrera added a comment - There's nothing in the log: [acabrera-md:zookeeper 507]$ cat build/test/logs/TEST-org.apache.zookeeper.test.AsyncHammerTest.txt Testsuite: org.apache.zookeeper.test.AsyncHammerTest
          Hide
          Patrick Hunt added a comment -

          Can you attach the log, that would be very useful for me to review prior to digging into the thread dump.

          Show
          Patrick Hunt added a comment - Can you attach the log, that would be very useful for me to review prior to digging into the thread dump.
          Hide
          Alan Cabrera added a comment -

          I reran the tests a minute or two later, without having done anything, and the test got hung. I attached the thread dump to this issue.

          Show
          Alan Cabrera added a comment - I reran the tests a minute or two later, without having done anything, and the test got hung. I attached the thread dump to this issue.
          Hide
          Patrick Hunt added a comment -

          The following snippets are from TEST-org.apache.zookeeper.test.AsyncHammerTest.txt

          Looks like there's a problem with the address already being in use. Maybe you can
          try editing src/java/test/org/apache/zookeeper/PortAssignment.java and see if that helps?

          2010-04-19 14:31:55,010 - INFO [main:QuorumBase@88] - Ports are: 127.0.0.1:11222,127.0.0.1:11223,127.0.0.1:11224,127.0.0.1:11225,127.0.0.1:11226
          2010-04-19 14:31:55,015 - INFO [main:QuorumBase@145] - creating QuorumPeer 1 port 11222
          2010-04-19 14:31:55,062 - INFO [main:NIOServerCnxn$Factory@144] - binding to port 0.0.0.0/0.0.0.0:11222
          2010-04-19 14:31:55,063 - INFO [main:AsyncHammerTest@68] - Test clients shutting down
          2010-04-19 14:31:55,064 - INFO [main:QuorumBase@254] - TearDown started

          Testcase: testHammer took 0.443 sec
          Caused an ERROR
          Address already in use
          java.net.BindException: Address already in use
          at sun.nio.ch.Net.bind(Native Method)
          at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
          at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
          at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:52)
          at org.apache.zookeeper.server.NIOServerCnxn$Factory.<init>(NIOServerCnxn.java:145)
          at org.apache.zookeeper.server.NIOServerCnxn$Factory.<init>(NIOServerCnxn.java:126)
          at org.apache.zookeeper.server.quorum.QuorumPeer.<init>(QuorumPeer.java:450)
          at org.apache.zookeeper.test.QuorumBase.startServers(QuorumBase.java:146)
          at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:96)
          at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:60)
          at org.apache.zookeeper.test.AsyncHammerTest.setUp(AsyncHammerTest.java:54)

          Caused an ERROR

          Show
          Patrick Hunt added a comment - The following snippets are from TEST-org.apache.zookeeper.test.AsyncHammerTest.txt Looks like there's a problem with the address already being in use. Maybe you can try editing src/java/test/org/apache/zookeeper/PortAssignment.java and see if that helps? 2010-04-19 14:31:55,010 - INFO [main:QuorumBase@88] - Ports are: 127.0.0.1:11222,127.0.0.1:11223,127.0.0.1:11224,127.0.0.1:11225,127.0.0.1:11226 2010-04-19 14:31:55,015 - INFO [main:QuorumBase@145] - creating QuorumPeer 1 port 11222 2010-04-19 14:31:55,062 - INFO [main:NIOServerCnxn$Factory@144] - binding to port 0.0.0.0/0.0.0.0:11222 2010-04-19 14:31:55,063 - INFO [main:AsyncHammerTest@68] - Test clients shutting down 2010-04-19 14:31:55,064 - INFO [main:QuorumBase@254] - TearDown started Testcase: testHammer took 0.443 sec Caused an ERROR Address already in use java.net.BindException: Address already in use at sun.nio.ch.Net.bind(Native Method) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:52) at org.apache.zookeeper.server.NIOServerCnxn$Factory.<init>(NIOServerCnxn.java:145) at org.apache.zookeeper.server.NIOServerCnxn$Factory.<init>(NIOServerCnxn.java:126) at org.apache.zookeeper.server.quorum.QuorumPeer.<init>(QuorumPeer.java:450) at org.apache.zookeeper.test.QuorumBase.startServers(QuorumBase.java:146) at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:96) at org.apache.zookeeper.test.QuorumBase.setUp(QuorumBase.java:60) at org.apache.zookeeper.test.AsyncHammerTest.setUp(AsyncHammerTest.java:54) Caused an ERROR
          Hide
          Alan Cabrera added a comment -

          Ok, this run of the test has been running for about 10 minutes

          Show
          Alan Cabrera added a comment - Ok, this run of the test has been running for about 10 minutes
          Hide
          Alan Cabrera added a comment - - edited

          Odd. Didn't hang this time.

          junit.run:
              [junit] Running org.apache.zookeeper.test.AsyncHammerTest
              [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 32.958 sec
              [junit] Test org.apache.zookeeper.test.AsyncHammerTest FAILED
          
          BUILD FAILED
          /Users/acabrera/dev/hadoop/zookeeper/build.xml:818: Tests failed!
          
          Total time: 46 seconds
          [acabrera-md:zookeeper 620]$ java -version
          java version "1.6.0_17"
          Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-9M3125)
          Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)
          

          Mac OS X 10.5.8

          Show
          Alan Cabrera added a comment - - edited Odd. Didn't hang this time. junit.run: [junit] Running org.apache.zookeeper.test.AsyncHammerTest [junit] Tests run: 2, Failures: 0, Errors: 2, Time elapsed: 32.958 sec [junit] Test org.apache.zookeeper.test.AsyncHammerTest FAILED BUILD FAILED /Users/acabrera/dev/hadoop/zookeeper/build.xml:818: Tests failed! Total time: 46 seconds [acabrera-md:zookeeper 620]$ java -version java version "1.6.0_17" Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-9M3125) Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode) Mac OS X 10.5.8
          Hide
          Patrick Hunt added a comment -

          Details such as os and java version would be really helpful. If you could upload the log file from the failed test that would be super useful. (build/test/logs/.)

          This test in particular has found at least a couple issues.

          Btw, you know that you can run just this test via:

          ant -Dtestcase=AsyncHammerTest test-core-java

          which might save you some time.

          Show
          Patrick Hunt added a comment - Details such as os and java version would be really helpful. If you could upload the log file from the failed test that would be super useful. (build/test/logs/.) This test in particular has found at least a couple issues. Btw, you know that you can run just this test via: ant -Dtestcase=AsyncHammerTest test-core-java which might save you some time.
          Hide
          Henry Robinson added a comment -

          Alan - that would be great. If you can take a jstack dump of the process when it hangs we can do some forensics.

          Show
          Henry Robinson added a comment - Alan - that would be great. If you can take a jstack dump of the process when it hangs we can do some forensics.
          Hide
          Alan Cabrera added a comment -

          This consistently hangs on my machine. I'll try to get some time to see why this is so.

          Show
          Alan Cabrera added a comment - This consistently hangs on my machine. I'll try to get some time to see why this is so.
          Hide
          Patrick Hunt added a comment -

          not a blocker for 3.3.0, tracking in 3.3.1. Have not seen these since, ie not reproduceable but let's keep an eye open.

          Show
          Patrick Hunt added a comment - not a blocker for 3.3.0, tracking in 3.3.1. Have not seen these since, ie not reproduceable but let's keep an eye open.
          Hide
          Patrick Hunt added a comment -

          downgrading from blocker, haven't seen this again after many runs. we'll keep an eye but shouldn't block 3.3.0 (and may have been fixed)

          Show
          Patrick Hunt added a comment - downgrading from blocker, haven't seen this again after many runs. we'll keep an eye but shouldn't block 3.3.0 (and may have been fixed)

            People

            • Assignee:
              Henry Robinson
              Reporter:
              Mahadev konar
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development