MINA
  1. MINA
  2. DIRMINA-762

WARN org.apache.mina.core.service.IoProcessor - Create a new selector. Selected is 0, delta = 0

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Component/s: None
    • Labels:
      None
    • Environment:
      Linux (2.6.26-2-amd64), java version 1.6.0_12 and also 1.6.0_18.

      Description

      Mina server gets into a bad state where it constantly prints :
      WARN org.apache.mina.core.service.IoProcessor - Create a new selector. Selected is 0, delta = 0

      when this happens, server throughput drops significantly.

      to reproduce run the attached server and client for a short while (30 seconds on my box).

      1. BufferCodec.java
        2 kB
        Omry Yadan
      2. NettyTestServer.java
        5 kB
        Omry Yadan
      3. RateCounter.java
        2 kB
        Omry Yadan
      4. Screen shot 2010-02-02 at 7.48.39 PM.png
        354 kB
        Emmanuel Lecharny
      5. Screen shot 2010-02-02 at 7.48.46 PM.png
        351 kB
        Emmanuel Lecharny
      6. Screen shot 2010-02-02 at 7.48.59 PM.png
        386 kB
        Emmanuel Lecharny
      7. Screen shot 2010-02-02 at 7.49.13 PM.png
        346 kB
        Emmanuel Lecharny
      8. Screen shot 2010-02-02 at 7.49.18 PM.png
        337 kB
        Emmanuel Lecharny
      9. Server.java
        2 kB
        Omry Yadan
      10. StressClient.java
        5 kB
        Omry Yadan

        Issue Links

          Activity

          Hide
          Emmanuel Lecharny added a comment -

          Here is a snapshot of the profiled session (64 threads bombarding the server with small messages every 10 ms).

          This is 6400 messages per second

          Show
          Emmanuel Lecharny added a comment - Here is a snapshot of the profiled session (64 threads bombarding the server with small messages every 10 ms). This is 6400 messages per second
          Hide
          Emmanuel Lecharny added a comment -

          Here are 5 snapshots :
          3 for the server
          2 for the client

          They expose :

          • Memory + CPU over time
          • Threads at the end of the test (green = running)
          • and allocated objects on the server

          One thing is strange : after having sent around 350 000 messages, the client and the server suddenly slow down. CPU which topped at 85% on my system goes down to 30%, and this is what we can see on both graphs.

          Also we have an giant number of ConcurrentLinkedQueue$Node objects, which get garbage collected later.

          I guess that at some point, we are simply pushing too many messages in the queues.

          Show
          Emmanuel Lecharny added a comment - Here are 5 snapshots : 3 for the server 2 for the client They expose : Memory + CPU over time Threads at the end of the test (green = running) and allocated objects on the server One thing is strange : after having sent around 350 000 messages, the client and the server suddenly slow down. CPU which topped at 85% on my system goes down to 30%, and this is what we can see on both graphs. Also we have an giant number of ConcurrentLinkedQueue$Node objects, which get garbage collected later. I guess that at some point, we are simply pushing too many messages in the queues.
          Hide
          Omry Yadan added a comment -

          Improved stress client:
          1. added rate prints (number of messages received in last 5 seconds).
          2. added monitoring threads. if a thread stalls for more than MAX_IDLE_WARNING (1000 ms) a warning will be printed.

          Show
          Omry Yadan added a comment - Improved stress client: 1. added rate prints (number of messages received in last 5 seconds). 2. added monitoring threads. if a thread stalls for more than MAX_IDLE_WARNING (1000 ms) a warning will be printed.
          Hide
          Omry Yadan added a comment -

          Looks like Netty 3.1.5.GA (latest stable release) does not suffer from the same problems.
          the attached Netty test server does exactly the same as the Mina test server in this ticket, but does not suffer from throughput drop.

          The Netty server also performs great when I use 64 persistent long connections without ANY delay and 0 short (open and close) connections:
          Against the netty client, I get something between 200k to 300k messages/sec on my box while Mina performs poorly at with the same client and get only 20 messages/sec.

          Show
          Omry Yadan added a comment - Looks like Netty 3.1.5.GA (latest stable release) does not suffer from the same problems. the attached Netty test server does exactly the same as the Mina test server in this ticket, but does not suffer from throughput drop. The Netty server also performs great when I use 64 persistent long connections without ANY delay and 0 short (open and close) connections: Against the netty client, I get something between 200k to 300k messages/sec on my box while Mina performs poorly at with the same client and get only 20 messages/sec.
          Hide
          Emmanuel Lecharny added a comment -

          It's interesting.

          I doubt that you obtain 200k msg/s when doing ( connect/sendMsg/receivedResp/disconnect ) though. In a scenario where you establish a connection, send N mesg, then disconect, yes, I agree, you will have a way higher number of messages exchanged.

          Here, I think this is the big difference.

          Show
          Emmanuel Lecharny added a comment - It's interesting. I doubt that you obtain 200k msg/s when doing ( connect/sendMsg/receivedResp/disconnect ) though. In a scenario where you establish a connection, send N mesg, then disconect, yes, I agree, you will have a way higher number of messages exchanged. Here, I think this is the big difference.
          Hide
          Omry Yadan added a comment -

          I compared apples to apples.
          the scenario with 64 persistent connections firing messages without any delay was run against both servers.

          the problem with open-send-receive-close test is that it's very easy to hit that pesky "java.net.NoRouteToHostException: Cannot assign requested address"
          I tweaked my linux tcp stack a bit and to better handle the open-send-receive-close use case (reuse time_wait and crap like that).
          the test was 3 threads a delay of 5 ms between messages.

          both servers had similar throughput (~560/sec for netty, ~540/sec for mina).
          this is nearly optimal, the maximum number of messages for this setup would be (1000/5)* 3=600.
          this indicates that I could get higher had I worked around the local ports issue so in any case it's not a very good te

          Show
          Omry Yadan added a comment - I compared apples to apples. the scenario with 64 persistent connections firing messages without any delay was run against both servers. the problem with open-send-receive-close test is that it's very easy to hit that pesky "java.net.NoRouteToHostException: Cannot assign requested address" I tweaked my linux tcp stack a bit and to better handle the open-send-receive-close use case (reuse time_wait and crap like that). the test was 3 threads a delay of 5 ms between messages. both servers had similar throughput (~560/sec for netty, ~540/sec for mina). this is nearly optimal, the maximum number of messages for this setup would be (1000/5)* 3=600. this indicates that I could get higher had I worked around the local ports issue so in any case it's not a very good te
          Hide
          Emmanuel Lecharny added a comment -

          Yeah, I did some tuning too n my laptop (setting the delay to 1 sec instead of a 15 sec defaut value) in order to be able to launch more clients.

          Do you have the modified client and server tests without all those connections and disconnection ?

          I have many servers available (on linux) right now, and I can run a test (5 injectors, one server)

          Show
          Emmanuel Lecharny added a comment - Yeah, I did some tuning too n my laptop (setting the delay to 1 sec instead of a 15 sec defaut value) in order to be able to launch more clients. Do you have the modified client and server tests without all those connections and disconnection ? I have many servers available (on linux) right now, and I can run a test (5 injectors, one server)
          Hide
          Emmanuel Lecharny added a comment -

          Ok, I found some time to run the new client. I get some interesting results :

          3092.25 messages/sec (total messages received 15235)
          3231.0 messages/sec (total messages received 31367)
          ...
          3232.25 messages/sec (total messages received 79864)
          Warning : Short connections thread 2 have no activity for 1322 ms
          Warning : Long connection thread 0 have no activity for 1271 ms
          Warning : Long connection thread 1 have no activity for 1262 ms
          Warning : Long connection thread 3 have no activity for 1271 ms
          Warning : Long connection thread 6 have no activity for 1264 ms
          Warning : Long connection thread 10 have no activity for 1269 ms
          Warning : Long connection thread 17 have no activity for 1262 ms
          Warning : Long connection thread 18 have no activity for 1261 ms
          Warning : Long connection thread 21 have no activity for 1271 ms
          Warning : Long connection thread 24 have no activity for 1267 ms
          Warning : Long connection thread 27 have no activity for 1271 ms
          Warning : Long connection thread 28 have no activity for 1261 ms
          Warning : Long connection thread 31 have no activity for 1262 ms
          2262.5 messages/sec (total messages received 91947)
          1733.0 messages/sec (total messages received 100753)
          Warning : Short connections thread 1 have no activity for 1479 ms
          Warning : Long connection thread 4 have no activity for 1421 ms
          Warning : Long connection thread 7 have no activity for 1430 ms
          Warning : Long connection thread 11 have no activity for 1422 ms
          Warning : Long connection thread 13 have no activity for 1424 ms
          Warning : Long connection thread 14 have no activity for 1430 ms
          Warning : Long connection thread 16 have no activity for 1422 ms
          Warning : Long connection thread 19 have no activity for 1424 ms
          Warning : Long connection thread 22 have no activity for 1423 ms
          Warning : Long connection thread 26 have no activity for 1430 ms
          Warning : Long connection thread 29 have no activity for 1421 ms
          1041.25 messages/sec (total messages received 105948)
          ...
          1035.5 messages/sec (total messages received 163086)
          Warning : Short connections thread 4 have no activity for 1844 ms
          Warning : Long connection thread 2 have no activity for 1793 ms
          Warning : Long connection thread 5 have no activity for 1786 ms
          Warning : Long connection thread 8 have no activity for 1785 ms
          Warning : Long connection thread 9 have no activity for 1785 ms
          Warning : Long connection thread 12 have no activity for 1793 ms
          Warning : Long connection thread 15 have no activity for 1783 ms
          Warning : Long connection thread 20 have no activity for 1793 ms
          Warning : Long connection thread 23 have no activity for 1784 ms
          Warning : Long connection thread 25 have no activity for 1786 ms
          Warning : Long connection thread 30 have no activity for 1785 ms
          39.5 messages/sec (total messages received 163483)
          ...

          There are 32 long connection created, and one third (roughly) is being 'killed', or stall, then a few moment later, another third stall, and at the end, the last third stall. At this point, we don't receive any message on the long connections.

          Seems like we have a problem writing data to the client at some point. The selector is not swapped, btw.

          Show
          Emmanuel Lecharny added a comment - Ok, I found some time to run the new client. I get some interesting results : 3092.25 messages/sec (total messages received 15235) 3231.0 messages/sec (total messages received 31367) ... 3232.25 messages/sec (total messages received 79864) Warning : Short connections thread 2 have no activity for 1322 ms Warning : Long connection thread 0 have no activity for 1271 ms Warning : Long connection thread 1 have no activity for 1262 ms Warning : Long connection thread 3 have no activity for 1271 ms Warning : Long connection thread 6 have no activity for 1264 ms Warning : Long connection thread 10 have no activity for 1269 ms Warning : Long connection thread 17 have no activity for 1262 ms Warning : Long connection thread 18 have no activity for 1261 ms Warning : Long connection thread 21 have no activity for 1271 ms Warning : Long connection thread 24 have no activity for 1267 ms Warning : Long connection thread 27 have no activity for 1271 ms Warning : Long connection thread 28 have no activity for 1261 ms Warning : Long connection thread 31 have no activity for 1262 ms 2262.5 messages/sec (total messages received 91947) 1733.0 messages/sec (total messages received 100753) Warning : Short connections thread 1 have no activity for 1479 ms Warning : Long connection thread 4 have no activity for 1421 ms Warning : Long connection thread 7 have no activity for 1430 ms Warning : Long connection thread 11 have no activity for 1422 ms Warning : Long connection thread 13 have no activity for 1424 ms Warning : Long connection thread 14 have no activity for 1430 ms Warning : Long connection thread 16 have no activity for 1422 ms Warning : Long connection thread 19 have no activity for 1424 ms Warning : Long connection thread 22 have no activity for 1423 ms Warning : Long connection thread 26 have no activity for 1430 ms Warning : Long connection thread 29 have no activity for 1421 ms 1041.25 messages/sec (total messages received 105948) ... 1035.5 messages/sec (total messages received 163086) Warning : Short connections thread 4 have no activity for 1844 ms Warning : Long connection thread 2 have no activity for 1793 ms Warning : Long connection thread 5 have no activity for 1786 ms Warning : Long connection thread 8 have no activity for 1785 ms Warning : Long connection thread 9 have no activity for 1785 ms Warning : Long connection thread 12 have no activity for 1793 ms Warning : Long connection thread 15 have no activity for 1783 ms Warning : Long connection thread 20 have no activity for 1793 ms Warning : Long connection thread 23 have no activity for 1784 ms Warning : Long connection thread 25 have no activity for 1786 ms Warning : Long connection thread 30 have no activity for 1785 ms 39.5 messages/sec (total messages received 163483) ... There are 32 long connection created, and one third (roughly) is being 'killed', or stall, then a few moment later, another third stall, and at the end, the last third stall. At this point, we don't receive any message on the long connections. Seems like we have a problem writing data to the client at some point. The selector is not swapped, btw.
          Hide
          Emmanuel Lecharny added a comment -

          Tested with 2.0.0-RC1, no problem so far. Definitively a bug introduced in one of the 47 commits since then ...

          I'm currently reverting the commits one by one until I see the test running, to find which one introduced the problem. It takes a hell of a time ...

          Show
          Emmanuel Lecharny added a comment - Tested with 2.0.0-RC1, no problem so far. Definitively a bug introduced in one of the 47 commits since then ... I'm currently reverting the commits one by one until I see the test running, to find which one introduced the problem. It takes a hell of a time ...
          Hide
          Emmanuel Lecharny added a comment -

          I think that this commit broke the server :
          http://svn.apache.org/viewvc?view=revision&revision=888842

          I will check what's inside (I wrote it, so I guess I'm responsible

          Show
          Emmanuel Lecharny added a comment - I think that this commit broke the server : http://svn.apache.org/viewvc?view=revision&revision=888842 I will check what's inside (I wrote it, so I guess I'm responsible
          Hide
          Serge Baranov added a comment -

          Emmanuel, that's exactly what I found yesterday, you should have started from that commit right away, see http://issues.apache.org/jira/browse/DIRMINA-678?focusedCommentId=12831297&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12831297 .

          Show
          Serge Baranov added a comment - Emmanuel, that's exactly what I found yesterday, you should have started from that commit right away, see http://issues.apache.org/jira/browse/DIRMINA-678?focusedCommentId=12831297&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12831297 .
          Hide
          Emmanuel Lecharny added a comment -

          sh*t... I could have saved me a hell of time...

          Yeah, Serge, this is THE commit. I'm checking what's inside. It's a pretty long one though...

          At least, we have a suspect now, time to use appropriate force on him to make him talk

          Show
          Emmanuel Lecharny added a comment - sh*t... I could have saved me a hell of time... Yeah, Serge, this is THE commit. I'm checking what's inside. It's a pretty long one though... At least, we have a suspect now, time to use appropriate force on him to make him talk
          Hide
          Emmanuel Lecharny added a comment -

          Having stripped down all that is irrelevant in this giant patch, I have some interesting parts :

          • a new atomic flag has been added (wakeupCalled), I don't remember why
          • this is used to protect the part of the code where we try to detect that the epoll is spinning
          • I'm wondering if we don't have a race condition here, and stay locked...

          A quick thread dump could help...

          Show
          Emmanuel Lecharny added a comment - Having stripped down all that is irrelevant in this giant patch, I have some interesting parts : a new atomic flag has been added (wakeupCalled), I don't remember why this is used to protect the part of the code where we try to detect that the epoll is spinning I'm wondering if we don't have a race condition here, and stay locked... A quick thread dump could help...
          Hide
          Emmanuel Lecharny added a comment -

          eh, eh, eh, I may have a fix

          The wakeupCalled.getAndSet(false); is not invoked before we do a "continue", so the flag keep the same value ad vitam aeternam. Wrong ...

          I have the sample running for more than 1 million messages, still at the same rate. Sounds good !

          Show
          Emmanuel Lecharny added a comment - eh, eh, eh, I may have a fix The wakeupCalled.getAndSet(false); is not invoked before we do a "continue", so the flag keep the same value ad vitam aeternam. Wrong ... I have the sample running for more than 1 million messages, still at the same rate. Sounds good !
          Hide
          Omry Yadan added a comment -

          sounds good

          Show
          Omry Yadan added a comment - sounds good
          Hide
          Emmanuel Lecharny added a comment -
          Show
          Emmanuel Lecharny added a comment - Seems fixed with : http://svn.apache.org/viewvc?rev=908669&view=rev
          Hide
          Omry Yadan added a comment -

          the problem I reported appears to be fixed, but this benchmark exposes another performance issue:
          configure it with 32 long connection, 0 short connection threads and 0 delay between long connection messages:

          // Number of long lived connections
          static final int LONG_CONNECTIONS = 32;

          // Number of threads that spawns short lived connections
          static final int SHORT_CONNECTIONS_THREADS = 0;

          // number of ms to sleep before sending a message to the server in the long connection threads
          static final int LONG_CONNECITON_THREAD_MESSAGE_DELAY = 0;

          Throughput it very low with this setup (about 10 messages/sec).

          With the attached Netty test server you I reached 200k-300k messages/sec with this setup.

          Show
          Omry Yadan added a comment - the problem I reported appears to be fixed, but this benchmark exposes another performance issue: configure it with 32 long connection, 0 short connection threads and 0 delay between long connection messages: // Number of long lived connections static final int LONG_CONNECTIONS = 32; // Number of threads that spawns short lived connections static final int SHORT_CONNECTIONS_THREADS = 0; // number of ms to sleep before sending a message to the server in the long connection threads static final int LONG_CONNECITON_THREAD_MESSAGE_DELAY = 0; Throughput it very low with this setup (about 10 messages/sec). With the attached Netty test server you I reached 200k-300k messages/sec with this setup.
          Hide
          Omry Yadan added a comment -

          updated stress client not to print false thread idle warnings at the start of the test.

          Show
          Omry Yadan added a comment - updated stress client not to print false thread idle warnings at the start of the test.
          Hide
          Emmanuel Lecharny added a comment -
          Show
          Emmanuel Lecharny added a comment - I moved the issue on https://issues.apache.org/jira/browse/DIRMINA-764
          Hide
          Emmanuel Lecharny added a comment -

          Linked the issue to the new created one.

          Show
          Emmanuel Lecharny added a comment - Linked the issue to the new created one.
          Hide
          Emmanuel Lecharny added a comment -

          Closing issues

          Show
          Emmanuel Lecharny added a comment - Closing issues

            People

            • Assignee:
              Unassigned
              Reporter:
              Omry Yadan
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development