Uploaded image for project: 'MINA'
  1. MINA
  2. DIRMINA-1041

WriteFuture.await() hangs when the connection is closed remotely before await is invoked

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.13
    • Fix Version/s: 2.0.16
    • Component/s: Core
    • Labels:
      None

      Description

      When in a high throughput scenario a TCP connection is closed remotely before one of the `await` methods on WriteFuture is invoked, the `await` method hangs.

      We've actually been seeing this issue for quite some time in our production environment in an application that uses Apache Camel with the camel-mina2 and camel-hl7 components. I was finally able to reproduce the issue in a unit test (see attached).

      1. DIRMINA1041Test.java
        3 kB
        Gijsbert van den Brink

        Activity

        Hide
        chrjohn Christoph John added a comment -

        I faced this problem with 2.0.15 in the QuickFIX/J test suite. However, it occurred very rarely. Will check with 2.0.16.

        Show
        chrjohn Christoph John added a comment - I faced this problem with 2.0.15 in the QuickFIX/J test suite. However, it occurred very rarely. Will check with 2.0.16.
        Hide
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment -

        Nice, great work!
        I will try to update to 2.0.16. Although I haven't had any problems since the workaround you did in 2.0.15, so I'm not sure if I will notice anything.

        Show
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment - Nice, great work! I will try to update to 2.0.16. Although I haven't had any problems since the workaround you did in 2.0.15, so I'm not sure if I will notice anything.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        I believe this commit (http://git-wip-us.apache.org/repos/asf/mina/commit/b5d0fbe3) fixes the issue.

        Please give it a try.

        Show
        elecharny Emmanuel Lecharny added a comment - I believe this commit ( http://git-wip-us.apache.org/repos/asf/mina/commit/b5d0fbe3 ) fixes the issue. Please give it a try.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        I think I have found a potential cause for the hang up : the DefaultWriteRequestQueue inner class in DefaultIoSessionDataStructureFactory, which was using a ConcurrentLinkedQueue instance, is also declaring some of its method synchronized, which does not make sense.

        Show
        elecharny Emmanuel Lecharny added a comment - I think I have found a potential cause for the hang up : the DefaultWriteRequestQueue inner class in DefaultIoSessionDataStructureFactory , which was using a ConcurrentLinkedQueue instance, is also declaring some of its method synchronized , which does not make sense.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Indeed. I haven't closed the issue because I haven't fixed it, I just workarounded it to be able to cut the release.

        What seems to happen is that when a session is closed on BOTH side, sometime one side does not terminate the session properly. So there is a race condition, but I wasn't able to find out the root cause (important side projects to deal with).

        I will continue to dig, now that I have somehow a clue about what could be the problem.

        Show
        elecharny Emmanuel Lecharny added a comment - Indeed. I haven't closed the issue because I haven't fixed it, I just workarounded it to be able to cut the release. What seems to happen is that when a session is closed on BOTH side, sometime one side does not terminate the session properly. So there is a race condition, but I wasn't able to find out the root cause (important side projects to deal with). I will continue to dig, now that I have somehow a clue about what could be the problem.
        Hide
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment -

        This is a different problem though right? If I run the test now it passes consistently, the original problem with the race condition seems to be fixed.

        Show
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment - This is a different problem though right? If I run the test now it passes consistently, the original problem with the race condition seems to be fixed.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Ok, it's defintively linked with the number of ephemeral ports available on my system. Here are the default values :

        $ sysctl -a | fgrep net.inet.ip.portrange
        net.inet.ip.portrange.lowfirst: 1023
        net.inet.ip.portrange.lowlast: 600
        net.inet.ip.portrange.first: 49152
        net.inet.ip.portrange.last: 65535
        net.inet.ip.portrange.hifirst: 49152
        net.inet.ip.portrange.hilast: 65535
        

        Changing the first and hifirst to 32768, I'm able to run the test up to 32 000 loops.

        Now, that means we aren't releasing the sockets fast enough, because adding a 2 ms delay makes the program happy. netstat does not show any used socket in TIME_WAIT state though.

        Show
        elecharny Emmanuel Lecharny added a comment - Ok, it's defintively linked with the number of ephemeral ports available on my system. Here are the default values : $ sysctl -a | fgrep net.inet.ip.portrange net.inet.ip.portrange.lowfirst: 1023 net.inet.ip.portrange.lowlast: 600 net.inet.ip.portrange.first: 49152 net.inet.ip.portrange.last: 65535 net.inet.ip.portrange.hifirst: 49152 net.inet.ip.portrange.hilast: 65535 Changing the first and hifirst to 32768 , I'm able to run the test up to 32 000 loops. Now, that means we aren't releasing the sockets fast enough, because adding a 2 ms delay makes the program happy. netstat does not show any used socket in TIME_WAIT state though.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        I have narrowed the problem to this stack trace :

        org.apache.mina.core.RuntimeIoException: Failed to get the session.
        	at org.apache.mina.core.future.DefaultConnectFuture.getSession(DefaultConnectFuture.java:68)
        	at org.apache.mina.transport.socket.nio.DIRMINA1041Test.testWrite(DIRMINA1041Test.java:57)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        	at java.lang.reflect.Method.invoke(Method.java:497)
        	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
        	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
        	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
        	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
        	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
        	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
        	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
        	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
        	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
        	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
        	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
        	at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86)
        	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
        	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459)
        	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675)
        	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382)
        	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192)
        Caused by: java.net.BindException: Can't assign requested address
        	at sun.nio.ch.Net.connect0(Native Method)
        	at sun.nio.ch.Net.connect(Net.java:454)
        	at sun.nio.ch.Net.connect(Net.java:446)
        	at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
        	at org.apache.mina.transport.socket.nio.NioSocketConnector.connect(NioSocketConnector.java:185)
        	at org.apache.mina.transport.socket.nio.NioSocketConnector.connect(NioSocketConnector.java:1)
        	at org.apache.mina.core.polling.AbstractPollingIoConnector.connect0(AbstractPollingIoConnector.java:368)
        	at org.apache.mina.core.service.AbstractIoConnector.connect(AbstractIoConnector.java:268)
        	at org.apache.mina.core.service.AbstractIoConnector.connect(AbstractIoConnector.java:187)
        	at org.apache.mina.transport.socket.nio.DIRMINA1041Test.testWrite(DIRMINA1041Test.java:50)
        	... 25 more
        

        after 16300 loops. I don't get such a trace when I add a 2 ms sleep in the loop (at least, not always). Using a Thread.sleep(1) in the loop make the test to fail.

        This 16300 number is pretty magic... Sounds like an OS limit.

        Show
        elecharny Emmanuel Lecharny added a comment - I have narrowed the problem to this stack trace : org.apache.mina.core.RuntimeIoException: Failed to get the session. at org.apache.mina.core.future.DefaultConnectFuture.getSession(DefaultConnectFuture.java:68) at org.apache.mina.transport.socket.nio.DIRMINA1041Test.testWrite(DIRMINA1041Test.java:57) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229) at org.junit.runners.ParentRunner.run(ParentRunner.java:309) at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:86) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:459) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:675) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:382) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:192) Caused by: java.net.BindException: Can't assign requested address at sun.nio.ch.Net.connect0(Native Method) at sun.nio.ch.Net.connect(Net.java:454) at sun.nio.ch.Net.connect(Net.java:446) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648) at org.apache.mina.transport.socket.nio.NioSocketConnector.connect(NioSocketConnector.java:185) at org.apache.mina.transport.socket.nio.NioSocketConnector.connect(NioSocketConnector.java:1) at org.apache.mina.core.polling.AbstractPollingIoConnector.connect0(AbstractPollingIoConnector.java:368) at org.apache.mina.core.service.AbstractIoConnector.connect(AbstractIoConnector.java:268) at org.apache.mina.core.service.AbstractIoConnector.connect(AbstractIoConnector.java:187) at org.apache.mina.transport.socket.nio.DIRMINA1041Test.testWrite(DIRMINA1041Test.java:50) ... 25 more after 16300 loops. I don't get such a trace when I add a 2 ms sleep in the loop (at least, not always). Using a Thread.sleep(1) in the loop make the test to fail. This 16300 number is pretty magic... Sounds like an OS limit.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        If I introduce a delay in the loop (7 ms for instance), I get the test passing. There is definitively something like a race condition that occurs somewhere :/

        Show
        elecharny Emmanuel Lecharny added a comment - If I introduce a delay in the loop (7 ms for instance), I get the test passing. There is definitively something like a race condition that occurs somewhere :/
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Ok, you are 2 OSX version before me. I will have to test that on my wife's computer (I'm a old timer, "don't fix it when it works" kind of person, so I don't really like switching from one version to the newer, my wife is a different person, Microsoft forced her to migrate...)

        Show
        elecharny Emmanuel Lecharny added a comment - Ok, you are 2 OSX version before me. I will have to test that on my wife's computer (I'm a old timer, "don't fix it when it works" kind of person, so I don't really like switching from one version to the newer, my wife is a different person, Microsoft forced her to migrate...)
        Hide
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment -

        OSX version 10.11.6, Java version 1.8.0b102. I use IntelliJ to run the test.

        Show
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment - OSX version 10.11.6, Java version 1.8.0b102. I use IntelliJ to run the test.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Also note that I'm running the test with the last changes, so that may impact the results...

        Show
        elecharny Emmanuel Lecharny added a comment - Also note that I'm running the test with the last changes, so that may impact the results...
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Can you tell me more about your env ? (OSX version, Java version)

        Show
        elecharny Emmanuel Lecharny added a comment - Can you tell me more about your env ? (OSX version, Java version)
        Hide
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment -

        I'm on Mac OSX and for me the test almost always fails within 1000 attempts. Sometimes after only 70, sometimes after 900.
        I'm don't really know how to debug this, but let me know if I can help with testing or debugging this issue.

        Show
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment - I'm on Mac OSX and for me the test almost always fails within 1000 attempts. Sometimes after only 70, sometimes after 900. I'm don't really know how to debug this, but let me know if I can help with testing or debugging this issue.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Now, that is weird. I can have the test failing constantly on Mac OSX, but for that, I have to run the loop 100 000 times. And strange enough, it blocks around 16350 (more or less). On Linux, it seems to block around 1000 loops. At this point, I suspect we are reaching a limit that is OS dependent. A netstat -a | wc -l just gives a number around 500, so we are not facing a ulimit problem.

        Diging...

        Show
        elecharny Emmanuel Lecharny added a comment - Now, that is weird. I can have the test failing constantly on Mac OSX, but for that, I have to run the loop 100 000 times. And strange enough, it blocks around 16350 (more or less). On Linux, it seems to block around 1000 loops. At this point, I suspect we are reaching a limit that is OS dependent. A netstat -a | wc -l just gives a number around 500, so we are not facing a ulimit problem. Diging...
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Still fails on all OS

        Show
        elecharny Emmanuel Lecharny added a comment - Still fails on all OS
        Show
        elecharny Emmanuel Lecharny added a comment - Ah, sadly, not every time. So there is clearly a race condition somewhere. https://builds.apache.org/job/MINA-2.0.X-jdk1.7-windows/org.apache.mina$mina-core/75/testReport/junit/org.apache.mina.transport.socket.nio/DIRMINA1041Test/testWrite/
        Hide
        elecharny Emmanuel Lecharny added a comment -

        The test passes on windows and on Mac OSX, fails on linux.

        Show
        elecharny Emmanuel Lecharny added a comment - The test passes on windows and on Mac OSX, fails on linux.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Crap... This is not enough :/ The test is still failing : https://builds.apache.org/job/MINA-2.0.X-jdk1.7-ubuntu/333/org.apache.mina$mina-core/

        Show
        elecharny Emmanuel Lecharny added a comment - Crap... This is not enough :/ The test is still failing : https://builds.apache.org/job/MINA-2.0.X-jdk1.7-ubuntu/333/org.apache.mina$mina-core/
        Hide
        elecharny Emmanuel Lecharny added a comment -

        I have added the code that tells the future the session has been closed.

        Show
        elecharny Emmanuel Lecharny added a comment - I have added the code that tells the future the session has been closed.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Ok, it seems that if the connection is closed remotely, the session's writing queue is not emptied, thus the future is not terminated properly, so it ends in a timeout (60s).

        Show
        elecharny Emmanuel Lecharny added a comment - Ok, it seems that if the connection is closed remotely, the session's writing queue is not emptied, thus the future is not terminated properly, so it ends in a timeout (60s).
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Seems like Jenkins does not like the test :
        https://builds.apache.org/job/MINA-2.0.X-jdk1.7-ubuntu/org.apache.mina$mina-core/332/console

        Ok, investigatio on its way.

        Show
        elecharny Emmanuel Lecharny added a comment - Seems like Jenkins does not like the test : https://builds.apache.org/job/MINA-2.0.X-jdk1.7-ubuntu/org.apache.mina$mina-core/332/console Ok, investigatio on its way.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Hmmm.

        I just ran the test again, and it works. The test has been added in the 2.0 branch : http://git-wip-us.apache.org/repos/asf/mina/commit/27204128

        (in Branch: refs/heads/2.0)

        Show
        elecharny Emmanuel Lecharny added a comment - Hmmm. I just ran the test again, and it works. The test has been added in the 2.0 branch : http://git-wip-us.apache.org/repos/asf/mina/commit/27204128 (in Branch: refs/heads/2.0)
        Hide
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment -

        Thanks!

        However, when I check out the 2.0 branch the test it still fails. And I don't see the test in trunk. Am I missing something?

        Show
        vandenbrink@zorgdomein.nl Gijsbert van den Brink added a comment - Thanks! However, when I check out the 2.0 branch the test it still fails. And I don't see the test in trunk. Am I missing something?
        Hide
        elecharny Emmanuel Lecharny added a comment -

        It seems the test is passing in trunk, since we fixed the problem related to the session remaining in a close state forever in some specific cases.

        I have added the test to the code base. Many thanks for it !

        Show
        elecharny Emmanuel Lecharny added a comment - It seems the test is passing in trunk, since we fixed the problem related to the session remaining in a close state forever in some specific cases. I have added the test to the code base. Many thanks for it !

          People

          • Assignee:
            Unassigned
            Reporter:
            vandenbrink@zorgdomein.nl Gijsbert van den Brink
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development