Kafka
  1. Kafka
  2. KAFKA-1501

transient unit tests failures due to port already in use

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core
    • Labels:

      Description

      Saw the following transient failures.

      kafka.api.ProducerFailureHandlingTest > testTooLargeRecordWithAckOne FAILED
      kafka.common.KafkaException: Socket server failed to bind to localhost:59909: Address already in use.
      at kafka.network.Acceptor.openServerSocket(SocketServer.scala:195)
      at kafka.network.Acceptor.<init>(SocketServer.scala:141)
      at kafka.network.SocketServer.startup(SocketServer.scala:68)
      at kafka.server.KafkaServer.startup(KafkaServer.scala:95)
      at kafka.utils.TestUtils$.createServer(TestUtils.scala:123)
      at kafka.api.ProducerFailureHandlingTest.setUp(ProducerFailureHandlingTest.scala:68)

      1. KAFKA-1501_2015-03-09_11:41:07.patch
        135 kB
        Ewen Cheslack-Postava
      2. KAFKA-1501_2015-03-25_00:44:50.patch
        166 kB
        Ewen Cheslack-Postava
      3. KAFKA-1501.patch
        133 kB
        Ewen Cheslack-Postava
      4. KAFKA-1501.patch
        74 kB
        Ewen Cheslack-Postava
      5. KAFKA-1501.patch
        3 kB
        Guozhang Wang
      6. KAFKA-1501.patch
        0.9 kB
        Guozhang Wang
      7. KAFKA-1501-choosePorts.patch
        73 kB
        Ewen Cheslack-Postava
      8. test-100.out
        20 kB
        Guozhang Wang
      9. test-100.out
        20 kB
        Guozhang Wang
      10. test-27.out
        21 kB
        Guozhang Wang
      11. test-29.out
        22 kB
        Guozhang Wang
      12. test-32.out
        21 kB
        Guozhang Wang
      13. test-35.out
        22 kB
        Guozhang Wang
      14. test-38.out
        21 kB
        Guozhang Wang
      15. test-4.out
        20 kB
        Guozhang Wang
      16. test-42.out
        21 kB
        Guozhang Wang
      17. test-45.out
        21 kB
        Guozhang Wang
      18. test-46.out
        23 kB
        Guozhang Wang
      19. test-51.out
        21 kB
        Guozhang Wang
      20. test-55.out
        22 kB
        Guozhang Wang
      21. test-58.out
        20 kB
        Guozhang Wang
      22. test-59.out
        21 kB
        Guozhang Wang
      23. test-60.out
        21 kB
        Guozhang Wang
      24. test-69.out
        20 kB
        Guozhang Wang
      25. test-72.out
        20 kB
        Guozhang Wang
      26. test-74.out
        21 kB
        Guozhang Wang
      27. test-76.out
        20 kB
        Guozhang Wang
      28. test-84.out
        22 kB
        Guozhang Wang
      29. test-87.out
        21 kB
        Guozhang Wang
      30. test-91.out
        20 kB
        Guozhang Wang
      31. test-92.out
        21 kB
        Guozhang Wang

        Issue Links

          Activity

          Jun Rao created issue -
          Hide
          Jun Rao added a comment -

          In some of the unit tests, we obtain an open port once and reuse it for multiple tests. Between tests, we shutdown both the clients and the brokers. However, it seems that a port may not be freed up immediately. Perhaps we should just get a new port for each test.

          Show
          Jun Rao added a comment - In some of the unit tests, we obtain an open port once and reuse it for multiple tests. Between tests, we shutdown both the clients and the brokers. However, it seems that a port may not be freed up immediately. Perhaps we should just get a new port for each test.
          Jun Rao made changes -
          Field Original Value New Value
          Labels newbie
          Hide
          Abhishek Sharma added a comment -

          Please assign this to me. I will try to make the change for regrading this.

          Show
          Abhishek Sharma added a comment - Please assign this to me. I will try to make the change for regrading this.
          Hide
          Abhishek Sharma added a comment -

          Jun Rao Can you please suggest how to reproduce this test failure.

          Show
          Abhishek Sharma added a comment - Jun Rao Can you please suggest how to reproduce this test failure.
          Hide
          Chris Cope added a comment -

          Ugh, this bug was obnoxious! This has bit us a enough times that we had to fix it. To isolate, I ran the full set of tests on our test farm 100x for trunk and 100x for 0.8.1.

          • trunk failed 11/100 times
          • 0.8.1 failed 12/100 times
            It's a race condition. The fix is for ZooKeeperTestHarness but I need to rebase and retest it. Also, I think the failure rate may be related to the underlying hardware (faster processing = more likely to hit the race condition). I should have a fix that has been tested with the latest trunk tonight.
          Show
          Chris Cope added a comment - Ugh, this bug was obnoxious! This has bit us a enough times that we had to fix it. To isolate, I ran the full set of tests on our test farm 100x for trunk and 100x for 0.8.1. trunk failed 11/100 times 0.8.1 failed 12/100 times It's a race condition. The fix is for ZooKeeperTestHarness but I need to rebase and retest it. Also, I think the failure rate may be related to the underlying hardware (faster processing = more likely to hit the race condition). I should have a fix that has been tested with the latest trunk tonight.
          Hide
          Jay Kreps added a comment -

          That's awesome, thanks for digging in on this.

          Show
          Jay Kreps added a comment - That's awesome, thanks for digging in on this.
          Hide
          Abhishek Sharma added a comment -

          Chris Cope - Are you running './gradlew core:test' or './gradlew testAll' or test suite under system_test ?
          I am asking because when I ran './gradlew core:test' I was not able to found any such issue.

          Show
          Abhishek Sharma added a comment - Chris Cope - Are you running './gradlew core:test' or './gradlew testAll' or test suite under system_test ? I am asking because when I ran './gradlew core:test' I was not able to found any such issue.
          Hide
          Chris Cope added a comment -

          Abhishek Sharma,

          git clone https://github.com/apache/kafka.git .
          ./gradlew test
          

          There will be failures anywhere from 10%-20% of the time. I think there is a race condition with TestUtils.choosePorts(), where a port is grabbed, closed, and then when the KafkaTestHarness uses it, it's not available yet. Looking through failures of the last few hundred test runs I've done, there is usually one 1 (occasionally 2) ports at fault that then cause subsequent tests to fail for the test class.

          Essentially, this race condition is occurring approximately 0.06% of the time a socket server is created. However, my team frequently has to rerun tests after a code change, because sometimes it fails. This is seen at least multiple times a day by us. The best solution seems to catch this exception and then grab new ports. Again we're talking about the test harness, so which ports it runs on doesn't matter. Thoughts?

          Show
          Chris Cope added a comment - Abhishek Sharma , git clone https: //github.com/apache/kafka.git . ./gradlew test There will be failures anywhere from 10%-20% of the time. I think there is a race condition with TestUtils.choosePorts(), where a port is grabbed, closed, and then when the KafkaTestHarness uses it, it's not available yet. Looking through failures of the last few hundred test runs I've done, there is usually one 1 (occasionally 2) ports at fault that then cause subsequent tests to fail for the test class. Essentially, this race condition is occurring approximately 0.06% of the time a socket server is created. However, my team frequently has to rerun tests after a code change, because sometimes it fails. This is seen at least multiple times a day by us. The best solution seems to catch this exception and then grab new ports. Again we're talking about the test harness, so which ports it runs on doesn't matter. Thoughts?
          Hide
          Abhishek Sharma added a comment -

          Chris Cope

          I did intentionally tried to produce this issue by manually giving the same port value to different server when I was not able to reproduce it.

          I was thinking solution over the same line that you have suggested - catch the exception and then grab new port from TestUtils port related function for retry. One thing we need to take care is - to limit the number of tries otherwise in worst scenario it may end up in loop.

          Show
          Abhishek Sharma added a comment - Chris Cope I did intentionally tried to produce this issue by manually giving the same port value to different server when I was not able to reproduce it. I was thinking solution over the same line that you have suggested - catch the exception and then grab new port from TestUtils port related function for retry. One thing we need to take care is - to limit the number of tries otherwise in worst scenario it may end up in loop.
          Hide
          Chris Cope added a comment - - edited

          I agree, Abhishek Singh. I'm running some more tests and I think the best way to handle this unlikely event is to catch it specifically, and then have it rerun the entire test class one time, and noting this in the test log. This bug does not affect the core Kafka code, and is simply exposed here because Kafka has such great unit tests, and we just happen to run them A LOT for our purposes. I'm proposing this solution instead of hunting and fixing the underlying issue in choosePorts(), which when looking around at other projects does seem like a decent implementation.

          The probability of a test class failing twice in a row should be very low (<0.0001%) and should result in any test class failure less than 1% of the time `./gradlew test` is run.

          Is this approach sound?

          Show
          Chris Cope added a comment - - edited I agree, Abhishek Singh . I'm running some more tests and I think the best way to handle this unlikely event is to catch it specifically, and then have it rerun the entire test class one time, and noting this in the test log. This bug does not affect the core Kafka code, and is simply exposed here because Kafka has such great unit tests, and we just happen to run them A LOT for our purposes. I'm proposing this solution instead of hunting and fixing the underlying issue in choosePorts(), which when looking around at other projects does seem like a decent implementation. The probability of a test class failing twice in a row should be very low (<0.0001%) and should result in any test class failure less than 1% of the time `./gradlew test` is run. Is this approach sound?
          Hide
          Abhishek Sharma added a comment -

          Running an entire test class again is not a good approach (I think so). I think catching specific exception and retrying again in catch block with alternative port seems to be more efficient. This needs a change only in test class and I think that should be OK.

          Other approach is like we can we have another util function in TestUtils class for getting connection with same approach.

          What you think about taking an input from some other on this?

          Show
          Abhishek Sharma added a comment - Running an entire test class again is not a good approach (I think so). I think catching specific exception and retrying again in catch block with alternative port seems to be more efficient. This needs a change only in test class and I think that should be OK. Other approach is like we can we have another util function in TestUtils class for getting connection with same approach. What you think about taking an input from some other on this?
          Hide
          Neha Narkhede added a comment -

          I agree that re-running the test suite is not ideal. Since the probability of this failure is fairly low, catching the exception and retrying choosePorts a couple times may suffice.

          Show
          Neha Narkhede added a comment - I agree that re-running the test suite is not ideal. Since the probability of this failure is fairly low, catching the exception and retrying choosePorts a couple times may suffice.
          Hide
          Guozhang Wang added a comment -

          I can re-produce the issue from time to time on my PC, and after digging around for a while I think this is the cause:

          http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html

          I am attaching a simple patch here, Chris Cope could you help double-checking if it fixes your issue?

          Show
          Guozhang Wang added a comment - I can re-produce the issue from time to time on my PC, and after digging around for a while I think this is the cause: http://www.serverframework.com/asynchronousevents/2011/01/time-wait-and-its-design-implications-for-protocols-and-scalable-servers.html I am attaching a simple patch here, Chris Cope could you help double-checking if it fixes your issue?
          Guozhang Wang made changes -
          Attachment KAFKA-1501.patch [ 12676664 ]
          Hide
          Guozhang Wang added a comment -

          Created reviewboard https://reviews.apache.org/r/27101/diff/
          against branch origin/trunk

          Show
          Guozhang Wang added a comment - Created reviewboard https://reviews.apache.org/r/27101/diff/ against branch origin/trunk
          Guozhang Wang made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Assignee Guozhang Wang [ guozhang ]
          Hide
          Jay Kreps added a comment -

          Nice!

          Show
          Jay Kreps added a comment - Nice!
          Hide
          Chris Cope added a comment -

          Awesome! I should know shortly, spinning off 100 jobs via phone...

          Show
          Chris Cope added a comment - Awesome! I should know shortly, spinning off 100 jobs via phone...
          Hide
          Chris Cope added a comment -

          Unfortunately same issues, 9/100 tests had a bunch of failures

          Show
          Chris Cope added a comment - Unfortunately same issues, 9/100 tests had a bunch of failures
          Hide
          Guozhang Wang added a comment -

          Thanks Chris, I also got the same conclusion on my tests. Need some more time tackling the problem.

          Show
          Guozhang Wang added a comment - Thanks Chris, I also got the same conclusion on my tests. Need some more time tackling the problem.
          Hide
          Jay Kreps added a comment -

          Oh, man, my heart is broken, I was sure that was going to work.

          It looks like a few projects do a few extra steps to check availability including creating a datagram socket using the given port. For example:
          http://stackoverflow.com/questions/434718/sockets-discover-port-availability-using-java

          The comments mention something in Apache Mina which is essentially the same.

          I have no idea why that is needed, but maybe worth trying that in choosePorts? I like the idea of improving choosePorts rather than changing SocketServer...

          Show
          Jay Kreps added a comment - Oh, man, my heart is broken, I was sure that was going to work. It looks like a few projects do a few extra steps to check availability including creating a datagram socket using the given port. For example: http://stackoverflow.com/questions/434718/sockets-discover-port-availability-using-java The comments mention something in Apache Mina which is essentially the same. I have no idea why that is needed, but maybe worth trying that in choosePorts? I like the idea of improving choosePorts rather than changing SocketServer...
          Guozhang Wang made changes -
          Attachment KAFKA-1501.patch [ 12677209 ]
          Hide
          Guozhang Wang added a comment -

          Created reviewboard https://reviews.apache.org/r/27214/diff/
          against branch origin/trunk

          Show
          Guozhang Wang added a comment - Created reviewboard https://reviews.apache.org/r/27214/diff/ against branch origin/trunk
          Hide
          Guozhang Wang added a comment -

          It turns that both the choosePort and socketServer needs to set SO_RESUEADDR to true. Have run 10 unit tests with v2 patch and no exceptions found (on my old PC that was a percentage of 30%+).

          Show
          Guozhang Wang added a comment - It turns that both the choosePort and socketServer needs to set SO_RESUEADDR to true. Have run 10 unit tests with v2 patch and no exceptions found (on my old PC that was a percentage of 30%+).
          Hide
          Jay Kreps added a comment -

          Nice, so statistically it is 93% likely to be fixed, then!

          So since this changes the socket server default is this the right thing to do? Could this have any negative side effects in production? I actually don't really understand the effect of this option or why lack of it was causing the failure. Could you explain?

          Show
          Jay Kreps added a comment - Nice, so statistically it is 93% likely to be fixed, then! So since this changes the socket server default is this the right thing to do? Could this have any negative side effects in production? I actually don't really understand the effect of this option or why lack of it was causing the failure. Could you explain?
          Hide
          Chris Cope added a comment -

          14/200 test runs failed with the "java.net.BindException: java.net.BindException: Address already in use" errors. This bug is rough.

          Show
          Chris Cope added a comment - 14/200 test runs failed with the "java.net.BindException: java.net.BindException: Address already in use" errors. This bug is rough.
          Hide
          Jay Kreps added a comment -

          It might be worth checking out that stack overflow article. It has several heuristics that it recommends.

          Show
          Jay Kreps added a comment - It might be worth checking out that stack overflow article. It has several heuristics that it recommends.
          Hide
          Guozhang Wang added a comment -

          Ouch... Could you share the stack trace on some of the failed tests? Are they all from SocketServer?

          From the JavaDocs and http://stackoverflow.com/questions/434718/sockets-discover-port-availability-using-java it seems setting SO_RESUEADDR has no negative effect, all it does is that if a port is bound to a socket with SO_RESUEADDR enabled and is in the WAIT_TIMEOUT state, another socket with SO_RESUEADDR enabled can bind to this port (i.s. reusing it).

          Show
          Guozhang Wang added a comment - Ouch... Could you share the stack trace on some of the failed tests? Are they all from SocketServer? From the JavaDocs and http://stackoverflow.com/questions/434718/sockets-discover-port-availability-using-java it seems setting SO_RESUEADDR has no negative effect, all it does is that if a port is bound to a socket with SO_RESUEADDR enabled and is in the WAIT_TIMEOUT state, another socket with SO_RESUEADDR enabled can bind to this port (i.s. reusing it).
          Hide
          Ewen Cheslack-Postava added a comment -

          Did anyone actually verify that a port is getting into TIME_WAIT or was that just a hunch? It actually seems unlikely since the socket was returned by choosePorts and there's no threading that would allow the socket to still be bound. And if it was a socket previously used for accept(), the only way it should end up in TIME_WAIT is if there was an outstanding connection request that hadn't been handled when the socket was closed.

          I think a much simpler explanation is that a port is being allocated twice within each test. I suspect you're seeing these errors on ZooKeeperTestHarness tests because it uses a single port that is allocated in the TestZKUtils object – that port is used for all tests. This means that there are plenty of times when that port is not bound (before a test has started) and choosePort() or choosePorts() is called (during test class instantiation), which could then return that same port and cause a conflict. Unfortunately, I am not able to reproduce this issue so I can't verify that. If someone else wants to try to verify, just logging the values returned by choosePort and the value of TestZKUtils.zookeeperConnect would make this issue easy to track down in a log.

          What we really need is to make sure that tests use a single call to choosePorts() to allocate all the ports they'll need. The attached patch should do this. It's obviously possible to call choosePorts() twice, but I've tried to discourage it. The choosePort() variant is removed and a warning is added to the choosePorts() documentation. It uses a new base class, NetworkTestHarness, for all tests that need to coordinate multiple ports (i.e., anything that uses ZookeeperTestHarness since at that point both ZookeeperTestHarness and the test class will probably need to call choosePorts()). Because of the way KafkaServerTestHarness works, I made them all get allocated at initialization (so configs for KafkaServerTestHarness can still be generated at test class instantiation). You have to know how many to allocate up front, but by default it allocates 5 so that all the current tests don't need to override anything.

          Chris Cope - can you test out this patch since you can reliably reproduce the issue? And can you give an idea of the type of hardware you're able to reproduce it on since you mentioned it seems common on beefier hardware?

          Show
          Ewen Cheslack-Postava added a comment - Did anyone actually verify that a port is getting into TIME_WAIT or was that just a hunch? It actually seems unlikely since the socket was returned by choosePorts and there's no threading that would allow the socket to still be bound. And if it was a socket previously used for accept(), the only way it should end up in TIME_WAIT is if there was an outstanding connection request that hadn't been handled when the socket was closed. I think a much simpler explanation is that a port is being allocated twice within each test. I suspect you're seeing these errors on ZooKeeperTestHarness tests because it uses a single port that is allocated in the TestZKUtils object – that port is used for all tests. This means that there are plenty of times when that port is not bound (before a test has started) and choosePort() or choosePorts() is called (during test class instantiation), which could then return that same port and cause a conflict. Unfortunately, I am not able to reproduce this issue so I can't verify that. If someone else wants to try to verify, just logging the values returned by choosePort and the value of TestZKUtils.zookeeperConnect would make this issue easy to track down in a log. What we really need is to make sure that tests use a single call to choosePorts() to allocate all the ports they'll need. The attached patch should do this. It's obviously possible to call choosePorts() twice, but I've tried to discourage it. The choosePort() variant is removed and a warning is added to the choosePorts() documentation. It uses a new base class, NetworkTestHarness, for all tests that need to coordinate multiple ports (i.e., anything that uses ZookeeperTestHarness since at that point both ZookeeperTestHarness and the test class will probably need to call choosePorts()). Because of the way KafkaServerTestHarness works, I made them all get allocated at initialization (so configs for KafkaServerTestHarness can still be generated at test class instantiation). You have to know how many to allocate up front, but by default it allocates 5 so that all the current tests don't need to override anything. Chris Cope - can you test out this patch since you can reliably reproduce the issue? And can you give an idea of the type of hardware you're able to reproduce it on since you mentioned it seems common on beefier hardware?
          Ewen Cheslack-Postava made changes -
          Attachment KAFKA-1501-choosePorts.patch [ 12678006 ]
          Hide
          Jay Kreps added a comment -

          Interesting. There is an inherent race condition in between choosePorts picking an open port and then the test actually using it. We can't perfectly solve this issue (i.e. since our apache tests run on a shared unit test machine even if we are perfect it is possible we could get broken by other tests in another process). However choosePorts is particularly prone to the problem Ewan described because it will always return the same port until someone actually uses it. Another approach would be to chose the port randomly in some range, check that you can bind to it, and then release it and give it out. This would make this kind of accident less likely as you would have to both hit the race condition and collide on the same port.

          Show
          Jay Kreps added a comment - Interesting. There is an inherent race condition in between choosePorts picking an open port and then the test actually using it. We can't perfectly solve this issue (i.e. since our apache tests run on a shared unit test machine even if we are perfect it is possible we could get broken by other tests in another process). However choosePorts is particularly prone to the problem Ewan described because it will always return the same port until someone actually uses it. Another approach would be to chose the port randomly in some range, check that you can bind to it, and then release it and give it out. This would make this kind of accident less likely as you would have to both hit the race condition and collide on the same port.
          Hide
          Guozhang Wang added a comment -

          OK. I finally got to re-produce the error after running about 60 times, all of which coming from the same place:

          java.net.BindException: Address already in use
          	at sun.nio.ch.Net.bind(Native Method)
          	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:124)
          	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
          	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:52)
          	at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95)
          	at kafka.zk.EmbeddedZookeeper.<init>(EmbeddedZookeeper.scala:33)
          	at kafka.zk.ZooKeeperTestHarness$class.setUp(ZooKeeperTestHarness.scala:33)
          	at kafka.admin.AdminTest.setUp(AdminTest.scala:33)
          	at junit.framework.TestCase.runBare(TestCase.java:128)
          	at junit.framework.TestResult$1.protect(TestResult.java:110)
          	at junit.framework.TestResult.runProtected(TestResult.java:128)
          	at junit.framework.TestResult.run(TestResult.java:113)
          	at junit.framework.TestCase.run(TestCase.java:120)
          	at junit.framework.TestSuite.runTest(TestSuite.java:228)
          	at junit.framework.TestSuite.run(TestSuite.java:223)
          	at org.junit.internal.runners.OldTestClassRunner.run(OldTestClassRunner.java:35)
          	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:86)
          	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:49)
          	at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:69)
          	at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:48)
          	at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
          	at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
          	at org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
          	at org.gradle.messaging.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
          	at $Proxy2.processTestClass(Unknown Source)
          	at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:105)
          	at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
          	at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
          	at org.gradle.messaging.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:355)
          	at org.gradle.internal.concurrent.DefaultExecutorFactory$StoppableExecutorImpl$1.run(DefaultExecutorFactory.java:64)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
          	at java.lang.Thread.run(Thread.java:662)
          

          Another observation is that a bulk of the binding errors (I got penty this time) happen consecutively, i.e. when the first binding error happens it continue to fail on the following 60+ test cases until resolves itself. So Ewen's theory is likely to be true: when such a race condition happens, it will not be resolved until it was used and released by one socket before the other tries to bind to it.

          Show
          Guozhang Wang added a comment - OK. I finally got to re-produce the error after running about 60 times, all of which coming from the same place: java.net.BindException: Address already in use at sun.nio.ch.Net.bind(Native Method) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:124) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:52) at org.apache.zookeeper.server.NIOServerCnxnFactory.configure(NIOServerCnxnFactory.java:95) at kafka.zk.EmbeddedZookeeper.<init>(EmbeddedZookeeper.scala:33) at kafka.zk.ZooKeeperTestHarness$class.setUp(ZooKeeperTestHarness.scala:33) at kafka.admin.AdminTest.setUp(AdminTest.scala:33) at junit.framework.TestCase.runBare(TestCase.java:128) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:120) at junit.framework.TestSuite.runTest(TestSuite.java:228) at junit.framework.TestSuite.run(TestSuite.java:223) at org.junit.internal.runners.OldTestClassRunner.run(OldTestClassRunner.java:35) at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:86) at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:49) at org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:69) at org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:48) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) at org.gradle.messaging.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32) at org.gradle.messaging.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93) at $Proxy2.processTestClass(Unknown Source) at org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:105) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35) at org.gradle.messaging.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24) at org.gradle.messaging.remote.internal.hub.MessageHub$Handler.run(MessageHub.java:355) at org.gradle.internal.concurrent.DefaultExecutorFactory$StoppableExecutorImpl$1.run(DefaultExecutorFactory.java:64) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662) Another observation is that a bulk of the binding errors (I got penty this time) happen consecutively, i.e. when the first binding error happens it continue to fail on the following 60+ test cases until resolves itself. So Ewen's theory is likely to be true: when such a race condition happens, it will not be resolved until it was used and released by one socket before the other tries to bind to it.
          Hide
          Guozhang Wang added a comment -

          Ewen Cheslack-Postava I try to test you patch but it does not apply to current trunk. Could you rebase?

          Show
          Guozhang Wang added a comment - Ewen Cheslack-Postava I try to test you patch but it does not apply to current trunk. Could you rebase?
          Ewen Cheslack-Postava made changes -
          Attachment KAFKA-1501.patch [ 12685059 ]
          Hide
          Ewen Cheslack-Postava added a comment -

          Created reviewboard https://reviews.apache.org/r/28694/diff/
          against branch origin/trunk

          Show
          Ewen Cheslack-Postava added a comment - Created reviewboard https://reviews.apache.org/r/28694/diff/ against branch origin/trunk
          Hide
          Ewen Cheslack-Postava added a comment -

          Guozhang Wang Rebased and passes tests, but as before I haven't verified it resolves the issue since I hadn't been able to reproduce the original issue.

          Show
          Ewen Cheslack-Postava added a comment - Guozhang Wang Rebased and passes tests, but as before I haven't verified it resolves the issue since I hadn't been able to reproduce the original issue.
          Hide
          Guozhang Wang added a comment -

          Ewen,

          Thanks for the patch. I tested it and there is still some "Address already in use" errors shown with 100 runs, but with much less occurrences it seems. Could you take a look and see if they were missed in your patch?

          I have uploaded the results with those failed cases.

          Show
          Guozhang Wang added a comment - Ewen, Thanks for the patch. I tested it and there is still some "Address already in use" errors shown with 100 runs, but with much less occurrences it seems. Could you take a look and see if they were missed in your patch? I have uploaded the results with those failed cases.
          Guozhang Wang made changes -
          Attachment test-100.out [ 12685384 ]
          Attachment test-27.out [ 12685385 ]
          Attachment test-100.out [ 12685386 ]
          Attachment test-92.out [ 12685387 ]
          Attachment test-91.out [ 12685388 ]
          Attachment test-87.out [ 12685389 ]
          Attachment test-84.out [ 12685390 ]
          Attachment test-76.out [ 12685391 ]
          Attachment test-74.out [ 12685392 ]
          Attachment test-72.out [ 12685393 ]
          Attachment test-69.out [ 12685394 ]
          Attachment test-60.out [ 12685395 ]
          Attachment test-59.out [ 12685396 ]
          Attachment test-58.out [ 12685397 ]
          Attachment test-55.out [ 12685398 ]
          Attachment test-51.out [ 12685399 ]
          Attachment test-46.out [ 12685400 ]
          Attachment test-45.out [ 12685401 ]
          Attachment test-42.out [ 12685402 ]
          Attachment test-38.out [ 12685403 ]
          Attachment test-35.out [ 12685404 ]
          Attachment test-32.out [ 12685405 ]
          Attachment test-29.out [ 12685406 ]
          Attachment test-4.out [ 12685407 ]
          Hide
          Ewen Cheslack-Postava added a comment -

          They've all been patched. I do notice that it's always failing on after a few tests within a suite, there's no particular test triggering it, and it happens for both ZK and Kafka nodes. The patch now allocates all the ports needed when the test class is instantiated and reuses them across individual tests. So now it looks more like a port is getting left in a state that doesn't allow it to be freed, i.e. something closer to the problem people originally thought it was. Unfortunately allocating ports separately for each test isn't easy because of the way the tests are currently structured (building configs during construction so setUp() can pull up the servers).

          Show
          Ewen Cheslack-Postava added a comment - They've all been patched. I do notice that it's always failing on after a few tests within a suite, there's no particular test triggering it, and it happens for both ZK and Kafka nodes. The patch now allocates all the ports needed when the test class is instantiated and reuses them across individual tests. So now it looks more like a port is getting left in a state that doesn't allow it to be freed, i.e. something closer to the problem people originally thought it was. Unfortunately allocating ports separately for each test isn't easy because of the way the tests are currently structured (building configs during construction so setUp() can pull up the servers).
          Hide
          Jay Kreps added a comment - - edited

          Yeah that Stack Overflow article I linked to previously indicates that as of Java 7 the only really reliable way to check is to try to create a socket and see if that works. That should be a super non-invasive change too--just updating the implementation for choosePorts.

          http://stackoverflow.com/questions/434718/sockets-discover-port-availability-using-java

          They give a checkAvailable method something like the below. So the new approach would be to choose a random port in some range, and then check that it is available.

          private static boolean checkAvailable(int port) {
              Socket s = null;
              try {
                  s = new Socket("localhost", port);
                  return false;
              } catch (IOException e) {
                  return true;
              } finally {
                  if( s != null){
                      try {
                          s.close();
                      } catch (IOException e) {
                          throw new RuntimeException("You should handle this error." , e);
                      }
                  }
              }
          }
          
          Show
          Jay Kreps added a comment - - edited Yeah that Stack Overflow article I linked to previously indicates that as of Java 7 the only really reliable way to check is to try to create a socket and see if that works. That should be a super non-invasive change too--just updating the implementation for choosePorts. http://stackoverflow.com/questions/434718/sockets-discover-port-availability-using-java They give a checkAvailable method something like the below. So the new approach would be to choose a random port in some range, and then check that it is available. private static boolean checkAvailable( int port) { Socket s = null ; try { s = new Socket( "localhost" , port); return false ; } catch (IOException e) { return true ; } finally { if ( s != null ){ try { s.close(); } catch (IOException e) { throw new RuntimeException( "You should handle this error." , e); } } } }
          Hide
          Guozhang Wang added a comment -

          In current choosePort we are already doing the same pattern, i.e. choose a random port and tries to open a socket for it, and close the socket before returning the ports.

          val sockets =
                for(i <- 0 until count)
                yield {
                  val socket = new ServerSocket()
                  ....
                  socket
                }
              val socketList = sockets.toList
              val ports = socketList.map(_.getLocalPort)
              socketList.map(_.close)
          

          I have incorporated both Ewen and my patch together but there are still a few in-use exceptions in a 100-run test.

          Show
          Guozhang Wang added a comment - In current choosePort we are already doing the same pattern, i.e. choose a random port and tries to open a socket for it, and close the socket before returning the ports. val sockets = for (i <- 0 until count) yield { val socket = new ServerSocket() .... socket } val socketList = sockets.toList val ports = socketList.map(_.getLocalPort) socketList.map(_.close) I have incorporated both Ewen and my patch together but there are still a few in-use exceptions in a 100-run test.
          Gwen Shapira made changes -
          Link This issue is duplicated by KAFKA-1916 [ KAFKA-1916 ]
          Hide
          Gwen Shapira added a comment -

          Is there anything pending here? Any reason not to commit this patch and reduce the test failures?

          Guozhang Wang Jay Kreps Jun Rao

          Show
          Gwen Shapira added a comment - Is there anything pending here? Any reason not to commit this patch and reduce the test failures? Guozhang Wang Jay Kreps Jun Rao
          Hide
          Guozhang Wang added a comment -

          Gwen, the problem is that, we seems still not been able to fix the problem with my and Ewen's patch together (sigh..) If you have some other ideas / proposals I would very much love to discuss as this issue is biting myself nearly every day.

          Show
          Guozhang Wang added a comment - Gwen, the problem is that, we seems still not been able to fix the problem with my and Ewen's patch together (sigh..) If you have some other ideas / proposals I would very much love to discuss as this issue is biting myself nearly every day.
          Ewen Cheslack-Postava made changes -
          Attachment KAFKA-1501.patch [ 12703098 ]
          Hide
          Ewen Cheslack-Postava added a comment -

          Created reviewboard https://reviews.apache.org/r/31806/diff/
          against branch origin/trunk

          Show
          Ewen Cheslack-Postava added a comment - Created reviewboard https://reviews.apache.org/r/31806/diff/ against branch origin/trunk
          Hide
          Ewen Cheslack-Postava added a comment -

          Uploaded a wip patch. It gets rid of choosePorts entirely and makes the tests work using random ports instead for both ZK and Kafka. A couple of notes:

          1. One change this necessitated is that a bunch of things that used to just be initialized during test class construction now have to be dynamic since you can't generate the Kafka configs until you know the ZK port. This has two impacts. First, KafkaServerTestHarness subclasses now have to override a generateConfigs() class rather than just overriding the configs field. Second, the minimal patch to make this work maintains the ability to access some data (info about zk, the list of configs) like fields (no ()), but I think this might just be misleading or confusing to people writing tests – something like getConfigs() might make it clearer that it will only be valid while a test is running.
          2. A few tests were specifying ports directly instead of using choosePorts. I think I found them all, but it'd be good to have a couple more eyes looking for them.
          3. Tests that bounce brokers became more difficult because the port changes when you restart. In most cases you this isn't a problem, you just need to make sure you instantiate producers/consumers at the right time. However, one test (ProducerFailureHandlingTest.testBrokerFailure) revealed an underlying issue. There are conditions where you can bounce the brokers too quickly and because of the way the new producer gets metadata, it can get stuck with old metadata and none of the brokers are listening on the ports it has. I included a patch which in theory should address the problem, but the producer is also having an issue where sometimes connection requests take a long time to finish, and during that time the brokers all bounce, leaving the producer with no useful addresses in its copy of the metadata. In practice you would never bounce your servers to new addresses that quickly, so this is purely an artifact of having to use random ports during tests. If anyone has suggestions for how to handle this, I'm all ears. In order to allow testing the rest of the patch, I commented out that test for the time being.

          I wanted to get this up so we can discuss these issues, but also so Guozhang Wang can test this to verify the approach will work before I spend much more time on it. I tested a few times with 5 copies of the tests running concurrently.

          Show
          Ewen Cheslack-Postava added a comment - Uploaded a wip patch. It gets rid of choosePorts entirely and makes the tests work using random ports instead for both ZK and Kafka. A couple of notes: 1. One change this necessitated is that a bunch of things that used to just be initialized during test class construction now have to be dynamic since you can't generate the Kafka configs until you know the ZK port. This has two impacts. First, KafkaServerTestHarness subclasses now have to override a generateConfigs() class rather than just overriding the configs field. Second, the minimal patch to make this work maintains the ability to access some data (info about zk, the list of configs) like fields (no ()), but I think this might just be misleading or confusing to people writing tests – something like getConfigs() might make it clearer that it will only be valid while a test is running. 2. A few tests were specifying ports directly instead of using choosePorts. I think I found them all, but it'd be good to have a couple more eyes looking for them. 3. Tests that bounce brokers became more difficult because the port changes when you restart. In most cases you this isn't a problem, you just need to make sure you instantiate producers/consumers at the right time. However, one test (ProducerFailureHandlingTest.testBrokerFailure) revealed an underlying issue. There are conditions where you can bounce the brokers too quickly and because of the way the new producer gets metadata, it can get stuck with old metadata and none of the brokers are listening on the ports it has. I included a patch which in theory should address the problem, but the producer is also having an issue where sometimes connection requests take a long time to finish, and during that time the brokers all bounce, leaving the producer with no useful addresses in its copy of the metadata. In practice you would never bounce your servers to new addresses that quickly, so this is purely an artifact of having to use random ports during tests. If anyone has suggestions for how to handle this, I'm all ears. In order to allow testing the rest of the patch, I commented out that test for the time being. I wanted to get this up so we can discuss these issues, but also so Guozhang Wang can test this to verify the approach will work before I spend much more time on it. I tested a few times with 5 copies of the tests running concurrently.
          Ewen Cheslack-Postava made changes -
          Attachment KAFKA-1501_2015-03-09_11:41:07.patch [ 12703463 ]
          Hide
          Ewen Cheslack-Postava added a comment -

          Updated reviewboard https://reviews.apache.org/r/31806/diff/
          against branch origin/trunk

          Show
          Ewen Cheslack-Postava added a comment - Updated reviewboard https://reviews.apache.org/r/31806/diff/ against branch origin/trunk
          Hide
          Guozhang Wang added a comment -

          Thanks for the patch Ewen Cheslack-Postava. I have started to look at your patch but want to get some clarifications:

          TestUtils.choosePorts should return a random port number each time it gets called, but then since the socket is closed, the next time it gets called, the same port number could be returned before it gets used, hence causing the conflict?

          Show
          Guozhang Wang added a comment - Thanks for the patch Ewen Cheslack-Postava . I have started to look at your patch but want to get some clarifications: TestUtils.choosePorts should return a random port number each time it gets called, but then since the socket is closed, the next time it gets called, the same port number could be returned before it gets used, hence causing the conflict?
          Hide
          Ewen Cheslack-Postava added a comment -

          Guozhang Wang Yes, that's correct, so the only way to completely avoid the problem is to allow the kernel to assign the port automatically. I haven't checked, but I also wouldn't be surprised if the kernel actually saves recently freed ports to use for a fast path, which could explain why this happens more frequently than you might think it would given the fairly large range used for ephemeral ports.

          Show
          Ewen Cheslack-Postava added a comment - Guozhang Wang Yes, that's correct, so the only way to completely avoid the problem is to allow the kernel to assign the port automatically. I haven't checked, but I also wouldn't be surprised if the kernel actually saves recently freed ports to use for a fast path, which could explain why this happens more frequently than you might think it would given the fairly large range used for ephemeral ports.
          Ewen Cheslack-Postava made changes -
          Attachment KAFKA-1501_2015-03-25_00:44:50.patch [ 12707164 ]
          Hide
          Ewen Cheslack-Postava added a comment -

          Updated reviewboard https://reviews.apache.org/r/31806/diff/
          against branch origin/trunk

          Show
          Ewen Cheslack-Postava added a comment - Updated reviewboard https://reviews.apache.org/r/31806/diff/ against branch origin/trunk
          Hide
          Ewen Cheslack-Postava added a comment -

          Guozhang Wang Sorry, seems I missed your review comments on the last patch. I've addressed most of them and rebased to currrent trunk. Also had to make some changes to handle the consumer tests that were added back. And I restored the test I commented to use the previous port allocation strategy so we still get test coverage while minimizing (but not entirely eliminating) tests that can fail due to ports already in use.

          We now have 2 places where we need fixed ports to properly handle tests – bounce tests for the new producer and new consumer implementations. Both are integration tests that probably better to have in a separate integration tests framework. But until we have a better solution than the existing system test framework, it probably does make sense to fix the majority of tests. A few can still fail, but at least we'll have drastically reduced the probability of encountering this error.

          Show
          Ewen Cheslack-Postava added a comment - Guozhang Wang Sorry, seems I missed your review comments on the last patch. I've addressed most of them and rebased to currrent trunk. Also had to make some changes to handle the consumer tests that were added back. And I restored the test I commented to use the previous port allocation strategy so we still get test coverage while minimizing (but not entirely eliminating) tests that can fail due to ports already in use. We now have 2 places where we need fixed ports to properly handle tests – bounce tests for the new producer and new consumer implementations. Both are integration tests that probably better to have in a separate integration tests framework. But until we have a better solution than the existing system test framework, it probably does make sense to fix the majority of tests. A few can still fail, but at least we'll have drastically reduced the probability of encountering this error.
          Hide
          Guozhang Wang added a comment -

          Ran 100 times with port issue not showing at all, compared to about 20% rate on the same machine I think we can call it "solved", thanks Ewen Cheslack-Postava!!

          Found some other transient unit tests with this 100 runs, filed KAFKA-2054 separately to track them.

          Show
          Guozhang Wang added a comment - Ran 100 times with port issue not showing at all, compared to about 20% rate on the same machine I think we can call it "solved", thanks Ewen Cheslack-Postava !! Found some other transient unit tests with this 100 runs, filed KAFKA-2054 separately to track them.
          Guozhang Wang made changes -
          Assignee Guozhang Wang [ guozhang ] Ewen Cheslack-Postava [ ewencp ]
          Hide
          Jay Kreps added a comment -

          Fantastic!

          Show
          Jay Kreps added a comment - Fantastic!
          Hide
          Jay Kreps added a comment -

          Committed. Ewen Cheslack-Postava is my personal hero.

          Show
          Jay Kreps added a comment - Committed. Ewen Cheslack-Postava is my personal hero.
          Jay Kreps made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Gwen Shapira added a comment - - edited

          Not to take from this awesome improvement, but any reason ProducerFailureHandlingTest.testBrokerFailure() was removed?

          (I mean, other than the fact that it is consistently failing / hanging after this patch... auto-merge left the test in and I found out the hard way.
          I admit failure to finding out why producers hang on lack of memory after this patch is added, but I'm a bit concerned)

          Show
          Gwen Shapira added a comment - - edited Not to take from this awesome improvement, but any reason ProducerFailureHandlingTest.testBrokerFailure() was removed? (I mean, other than the fact that it is consistently failing / hanging after this patch... auto-merge left the test in and I found out the hard way. I admit failure to finding out why producers hang on lack of memory after this patch is added, but I'm a bit concerned)
          Hide
          Chris Cope added a comment -

          No failures aside from those tracked in KAFKA-2054. The port collisions is fixed. Ewen Cheslack-Postava gets free beer at the next conference. Bonus beer for reducing LOC

          Show
          Chris Cope added a comment - No failures aside from those tracked in KAFKA-2054 . The port collisions is fixed. Ewen Cheslack-Postava gets free beer at the next conference. Bonus beer for reducing LOC
          Hide
          Jay Kreps added a comment -

          Gwen Shapira I think the bounce tests in both the producer and consumer were moved, not deleted (I was confused by the same thing).

          Show
          Jay Kreps added a comment - Gwen Shapira I think the bounce tests in both the producer and consumer were moved, not deleted (I was confused by the same thing).
          Hide
          Gwen Shapira added a comment -

          woops. You are right Jay Kreps, I overlooked it.
          I'll join the line of people buying Ewen Cheslack-Postava a beer

          Show
          Gwen Shapira added a comment - woops. You are right Jay Kreps , I overlooked it. I'll join the line of people buying Ewen Cheslack-Postava a beer
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Patch Available Patch Available
          125d 17h 53m 1 Guozhang Wang 23/Oct/14 19:34
          Patch Available Patch Available Resolved Resolved
          163d 3h 12m 1 Jay Kreps 04/Apr/15 22:46

            People

            • Assignee:
              Ewen Cheslack-Postava
              Reporter:
              Jun Rao
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development