Uploaded image for project: 'MINA'
  1. MINA
  2. DIRMINA-764

DDOS possible in only a few seconds...

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Won't Fix
    • Affects Version/s: 2.0.0-RC1
    • Fix Version/s: 2.0.8
    • Component/s: None
    • Labels:
      None

      Description

      We can kill a server in just a few seconds using the stress test found in DIRMINA-762.

      If we inject messages with no delay, using 50 threads to do that, the ProtocolCodecFilter$MessageWriteRequest is stuffed with hundred of thousands messages waiting to be written back to the client, with no success.

      On the client side, we receive almost no messages :
      0 messages/sec (total messages received 1)
      2 messages/sec (total messages received 11)
      8 messages/sec (total messages received 55)
      8 messages/sec (total messages received 95)
      9 messages/sec (total messages received 144)
      3 messages/sec (total messages received 162)
      1 messages/sec (total messages received 169)
      ...

      On the server side, the memory is totally swamped in 20 seconds, with no way to recover :
      Exception in thread "pool-1-thread-1" java.lang.OutOfMemoryError: Java heap space

      (see graph attached)

      On the server, ConcurrentLinkedQueue contain the messages to be written (in my case, 724 499 Node are present). There are also 361629 DefaultWriteRequests, 361628 DefaultWriteFutures, 361625 SimpleBuffer, 361 618 ProtocolCodecFilter$MessageWriteRequest and 361 614 ProtocolCodecFilter$EncodedWriteRequests.

      That mean we don't flush them to the client at all.

      1. screenshot-1.jpg
        126 kB
        Emmanuel Lecharny
      2. screenshot-2.jpg
        185 kB
        Emmanuel Lecharny

        Issue Links

          Activity

          Hide
          elecharny Emmanuel Lecharny added a comment -

          The CPU when running the test : 100%, with a lot of System CPU

          Show
          elecharny Emmanuel Lecharny added a comment - The CPU when running the test : 100%, with a lot of System CPU
          Hide
          elecharny Emmanuel Lecharny added a comment -

          The memory consumption. All the memory is eaten in 20 seconds.

          Show
          elecharny Emmanuel Lecharny added a comment - The memory consumption. All the memory is eaten in 20 seconds.
          Hide
          vicnov Victor N added a comment -

          Emmanuel, are you clients in this test fast enough to read at the speed proposed by the server? Also, is the network between the server and the client fast enough?
          Maybe "read buffer" is too small in the client? I do not see it configured in the stress client.
          I would say that it is typical - when some server is writing too quickly into a socket, so that some client can not read at this speed, the server will die in OutOfMemory
          You need to throttle/limit the write speed somehow. As I know, in mina, writeRequestQueue is unlimited in IoSession

          Show
          vicnov Victor N added a comment - Emmanuel, are you clients in this test fast enough to read at the speed proposed by the server? Also, is the network between the server and the client fast enough? Maybe "read buffer" is too small in the client? I do not see it configured in the stress client. I would say that it is typical - when some server is writing too quickly into a socket, so that some client can not read at this speed, the server will die in OutOfMemory You need to throttle/limit the write speed somehow. As I know, in mina, writeRequestQueue is unlimited in IoSession
          Hide
          elecharny Emmanuel Lecharny added a comment -

          The read is blocking, so I guess it reads as soon as something returns...

          The network is fast enough, hopefully, as I ran the test locally !

          Also the messages are 9 bytes long. No need of extra large buffers here :/

          I think there is a huge problem in the way the server handles the channel ready for write : it seems to send just one single message. I have to check that though.

          Show
          elecharny Emmanuel Lecharny added a comment - The read is blocking, so I guess it reads as soon as something returns... The network is fast enough, hopefully, as I ran the test locally ! Also the messages are 9 bytes long. No need of extra large buffers here :/ I think there is a huge problem in the way the server handles the channel ready for write : it seems to send just one single message. I have to check that though.
          Hide
          vicnov Victor N added a comment -

          I am not 100% sure but IMHO when you run stress clients and the server on the same host, so the CPU and I/O activity are high, there may be troubles in testing.
          I would propose to run the same test in LAN environment - all clients on a separate machine or even multiple machines.

          As for TCP buffers, they do not depend on how you use your socket - via blocking or non-blocking I/O, locally or remotely. If your client works slowly (under high load on your computer), it will read slowly; in addition, if it has a small TCP buffer for reading - the whole process of TCP transmission is stalled, the server will not send to socket anymore (remember how the congestion control algorithm in TCP works?)

          Of course, maybe this is not the case in your test, so it would be useful to compare with another mina build before you start digging into the code

          Show
          vicnov Victor N added a comment - I am not 100% sure but IMHO when you run stress clients and the server on the same host, so the CPU and I/O activity are high, there may be troubles in testing. I would propose to run the same test in LAN environment - all clients on a separate machine or even multiple machines. As for TCP buffers, they do not depend on how you use your socket - via blocking or non-blocking I/O, locally or remotely. If your client works slowly (under high load on your computer), it will read slowly; in addition, if it has a small TCP buffer for reading - the whole process of TCP transmission is stalled, the server will not send to socket anymore (remember how the congestion control algorithm in TCP works?) Of course, maybe this is not the case in your test, so it would be useful to compare with another mina build before you start digging into the code
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Ok, there is some slight problem in the client : we don't wait for the response, we immediately send another message. The server does not have the time to send the response, as it is pounded with new requests.

          I have slightly modified the client code to wait until some bytes are available, instead of immediately sending a new message.

          The server is now stable, dealing with around 12 000 message per second. No OOM, but a very high CPU consumption.

          Sadly, I can tell if the System CPU is caused by the server at this point. I have to run the test on different machines.

          However, I keep this issue open, because a malevolent client can kill a mina server in a matter of seconds. This has to be fixed.

          Show
          elecharny Emmanuel Lecharny added a comment - Ok, there is some slight problem in the client : we don't wait for the response, we immediately send another message. The server does not have the time to send the response, as it is pounded with new requests. I have slightly modified the client code to wait until some bytes are available, instead of immediately sending a new message. The server is now stable, dealing with around 12 000 message per second. No OOM, but a very high CPU consumption. Sadly, I can tell if the System CPU is caused by the server at this point. I have to run the test on different machines. However, I keep this issue open, because a malevolent client can kill a mina server in a matter of seconds. This has to be fixed.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Victor, you are perfectly right.

          My intention is to build a test environnement, as I have a 4-way CPU with 16Gb ram, and 5 injectors, and a Gb ethernet network. On my local machine, I'm most certainly bounded by the clients which are sucking 2/3 of the CPU.

          Right now, I'm just worrying about the server crash I get.

          Show
          elecharny Emmanuel Lecharny added a comment - Victor, you are perfectly right. My intention is to build a test environnement, as I have a 4-way CPU with 16Gb ram, and 5 injectors, and a Gb ethernet network. On my local machine, I'm most certainly bounded by the clients which are sucking 2/3 of the CPU. Right now, I'm just worrying about the server crash I get.
          Hide
          omry Omry Yadan added a comment -

          looks like a minor client bug indeed which would manifest itself if the server is slow.
          I don't think running on the same machine is really the issue here: when I run the same stress client against a Netty test server which does exactly the same (also attached to 762) I get throughput of 200k-300k messages/sec.

          Show
          omry Omry Yadan added a comment - looks like a minor client bug indeed which would manifest itself if the server is slow. I don't think running on the same machine is really the issue here: when I run the same stress client against a Netty test server which does exactly the same (also attached to 762) I get throughput of 200k-300k messages/sec.
          Hide
          vicnov Victor N added a comment -

          >> a malevolent client can kill a mina server in a matter of seconds. This has to be fixed. <<

          In fact, this is not mina-specific problem, this is more common in network world. But I agree, we should propose some solutions, ex.:

          1) writeRequestQueue may be bounded - somewhere we could configure its size and a policy "what to do when the queue is full" (like in Executors)
          2) some kind of write throttling (optionally) - as I remember, mina already has IoEventQueueThrottle class, but I never used it and I do not know if it is up-to-date

          If some client (an IoSession) is slow, that is there are many events waiting for socket write, it is server application's responsibility to decide what to do - ignore new events, send some kind of warning to client ("hay, mister, you network is too slow, you risk to be disconnected!"), maybe event client disconnection after some time, etc. If client and server can negotiate in this situation, everything will work well. We did something like this for Flash clients using Red5 server (based on mina) - we checked writeRequestQueue (or calculated the number or pending write request, maybe) and tuned frame rate of video stream; sometimes we sent a warning to client

          Of course, there may be"bad clients" trying to do DDOS - this way we can also handle such situations.

          Show
          vicnov Victor N added a comment - >> a malevolent client can kill a mina server in a matter of seconds. This has to be fixed. << In fact, this is not mina-specific problem, this is more common in network world. But I agree, we should propose some solutions, ex.: 1) writeRequestQueue may be bounded - somewhere we could configure its size and a policy "what to do when the queue is full" (like in Executors) 2) some kind of write throttling (optionally) - as I remember, mina already has IoEventQueueThrottle class, but I never used it and I do not know if it is up-to-date If some client (an IoSession) is slow, that is there are many events waiting for socket write, it is server application's responsibility to decide what to do - ignore new events, send some kind of warning to client ("hay, mister, you network is too slow, you risk to be disconnected!"), maybe event client disconnection after some time, etc. If client and server can negotiate in this situation, everything will work well. We did something like this for Flash clients using Red5 server (based on mina) - we checked writeRequestQueue (or calculated the number or pending write request, maybe) and tuned frame rate of video stream; sometimes we sent a warning to client Of course, there may be"bad clients" trying to do DDOS - this way we can also handle such situations.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Netty deal with messages in a completely different way : it has 2 chains, one for incoming messages, one for outgoing messages (something MINA should have had since day one ...). It allows for a much better throughput.

          I haven't read Netty's code, but I suspect also that no copy is done, and that it does not uses queues to transfert messages from one filter to another. That could help a lot.

          Show
          elecharny Emmanuel Lecharny added a comment - Netty deal with messages in a completely different way : it has 2 chains, one for incoming messages, one for outgoing messages (something MINA should have had since day one ...). It allows for a much better throughput. I haven't read Netty's code, but I suspect also that no copy is done, and that it does not uses queues to transfert messages from one filter to another. That could help a lot.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          I did some tests with Netty and I don't get at all the 200 Kmsg/s you get.

          Running the exact same client, with a 2 ms delay waiting for incoming messages to be there, 100 threads, I top at 15 000 msg/s, 3 000 more than MINA. Now, if I set the delay to 0, I top at 6 000 msg/s, but the good news is that NETTY does not stale.

          It seems that NETTY does have a way to throttle the throughput, someting we have to implement in MINA.

          Show
          elecharny Emmanuel Lecharny added a comment - I did some tests with Netty and I don't get at all the 200 Kmsg/s you get. Running the exact same client, with a 2 ms delay waiting for incoming messages to be there, 100 threads, I top at 15 000 msg/s, 3 000 more than MINA. Now, if I set the delay to 0, I top at 6 000 msg/s, but the good news is that NETTY does not stale. It seems that NETTY does have a way to throttle the throughput, someting we have to implement in MINA.
          Hide
          vicnov Victor N added a comment -

          I found on Netty's documentation page:

          1. No more OutOfMemoryError due to fast, slow or overloaded connection.
          2. No more unfair read / write ratio often found in a NIO application under high speed network

          This is what we should implement in mina 2.0 - protect ourselves from clients writing too quickly or reading too slowly.
          Emmanuel, seems that "unfair read / write" ratio is what you have seen in your test!

          Show
          vicnov Victor N added a comment - I found on Netty's documentation page: No more OutOfMemoryError due to fast, slow or overloaded connection. No more unfair read / write ratio often found in a NIO application under high speed network This is what we should implement in mina 2.0 - protect ourselves from clients writing too quickly or reading too slowly. Emmanuel, seems that "unfair read / write" ratio is what you have seen in your test!
          Hide
          elecharny Emmanuel Lecharny added a comment -

          As I suspected, the way MINA handle writes to the client is buggy.

          Messages stored in the WriteQueue waiting to be sent are processed one by one, with a loop done on the select() process between each of them. The queue will be emptied very slowly if clients never stop injecting new messages asynchronously...

          It has to be understood that for every message to be written, an empty message is added in the queue to be able to generate a messageSent event. Not optimal ...

          Show
          elecharny Emmanuel Lecharny added a comment - As I suspected, the way MINA handle writes to the client is buggy. Messages stored in the WriteQueue waiting to be sent are processed one by one, with a loop done on the select() process between each of them. The queue will be emptied very slowly if clients never stop injecting new messages asynchronously... It has to be understood that for every message to be written, an empty message is added in the queue to be able to generate a messageSent event. Not optimal ...
          Hide
          elecharny Emmanuel Lecharny added a comment -

          I have fixed the bug : the request buffer was reset just because we need it for the messageSent() event, but never set back to its initial position, leading to a double call (even if the message wasn't sent again).

          There is a net 5% gain in performance when repositioning the buffer to its initial position. I go up to 13900 mesg/s against 13200 before. We are still slower than NETTY 3 but the margin is thinner (it was 18% slower before, and now only 12%: netty reaches 15640 msg/s)

          Patch : http://svn.apache.org/viewvc?rev=910779&view=rev

          Show
          elecharny Emmanuel Lecharny added a comment - I have fixed the bug : the request buffer was reset just because we need it for the messageSent() event, but never set back to its initial position, leading to a double call (even if the message wasn't sent again). There is a net 5% gain in performance when repositioning the buffer to its initial position. I go up to 13900 mesg/s against 13200 before. We are still slower than NETTY 3 but the margin is thinner (it was 18% slower before, and now only 12%: netty reaches 15640 msg/s) Patch : http://svn.apache.org/viewvc?rev=910779&view=rev
          Hide
          trustin Trustin Lee added a comment -

          What Netty version is being used for testing?

          Show
          trustin Trustin Lee added a comment - What Netty version is being used for testing?
          Hide
          elecharny Emmanuel Lecharny added a comment -

          The last one. But take those tests with extra care : I run everything on my laptop, thus the client (plain BIO) kills the CPU. Should the server turn on its own box, the results would be very different.

          Show
          elecharny Emmanuel Lecharny added a comment - The last one. But take those tests with extra care : I run everything on my laptop, thus the client (plain BIO) kills the CPU. Should the server turn on its own box, the results would be very different.
          Hide
          omry Omry Yadan added a comment -

          checked with latest from trunk, and the problem still exists (although performance is slightly better).

          Show
          omry Omry Yadan added a comment - checked with latest from trunk, and the problem still exists (although performance is slightly better).
          Hide
          elecharny Emmanuel Lecharny added a comment -

          We have done some extra tests yesterday, running the same client against Netty 3, and we are also "successfully" killing it in a matter of seconds. The only difference is that it does not die from an OOM.

          The JVM and the OS might also be a part of the problem (MacOSX, snow leopard with a crappy JVM)

          Show
          elecharny Emmanuel Lecharny added a comment - We have done some extra tests yesterday, running the same client against Netty 3, and we are also "successfully" killing it in a matter of seconds. The only difference is that it does not die from an OOM. The JVM and the OS might also be a part of the problem (MacOSX, snow leopard with a crappy JVM)
          Hide
          elecharny Emmanuel Lecharny added a comment -

          I now know why we process read way much that write, and never have time to write data. The way the code is written in the core make it likely that we call twice more time the read() method than the write() method.

          The way it 'works' is that for each read message that produces a response, we enqueue two messages to be flushed :

          • the real response
          • an empty marker to be able to process correctly the messageSent() event.

          Sadly, we keep the selectionKey to OP_WRITE until the second message is written, so we do two select() for each message to write. If the clients write to the server not waiting for the response to come back, then we enqueue a hell lot of messages we won't never be able to send to the client. After a while, all the queus will be full of crap, and the JVM will die with an OOM.

          This is bad design from day one, with pile of hacks added to try to make the code to work. Ugly. Not sure this can be fixed easily, but will try. Puke. Puke, puke, puke !

          Show
          elecharny Emmanuel Lecharny added a comment - I now know why we process read way much that write, and never have time to write data. The way the code is written in the core make it likely that we call twice more time the read() method than the write() method. The way it 'works' is that for each read message that produces a response, we enqueue two messages to be flushed : the real response an empty marker to be able to process correctly the messageSent() event. Sadly, we keep the selectionKey to OP_WRITE until the second message is written, so we do two select() for each message to write. If the clients write to the server not waiting for the response to come back, then we enqueue a hell lot of messages we won't never be able to send to the client. After a while, all the queus will be full of crap, and the JVM will die with an OOM. This is bad design from day one, with pile of hacks added to try to make the code to work. Ugly. Not sure this can be fixed easily, but will try. Puke. Puke, puke, puke !
          Hide
          elecharny Emmanuel Lecharny added a comment -

          One more fix : there were a useless loop done in the writeRequest processing. I removed it with :

          http://svn.apache.org/viewvc?rev=911833&view=rev

          The gain is again around 3.5%, or 500 msg/s. NETTY 3 is only 8.7% faster than MINA now.

          Keep going ...

          Show
          elecharny Emmanuel Lecharny added a comment - One more fix : there were a useless loop done in the writeRequest processing. I removed it with : http://svn.apache.org/viewvc?rev=911833&view=rev The gain is again around 3.5%, or 500 msg/s. NETTY 3 is only 8.7% faster than MINA now. Keep going ...
          Hide
          vrm Julien Vermillard added a comment -

          I really wonder what was the idea behind this 'spin' ?

          Show
          vrm Julien Vermillard added a comment - I really wonder what was the idea behind this 'spin' ?
          Hide
          trustin Trustin Lee added a comment -

          Latest stable or latest unstable?

          Show
          trustin Trustin Lee added a comment - Latest stable or latest unstable?
          Hide
          trustin Trustin Lee added a comment -

          If you were using 3.2.0.ALPHA4, it has a known performance regression and freeze. Try this one instead:

          http://hudson.jboss.org/hudson/view/Netty/job/netty/3022/artifact/trunk/target/netty-3.2.0.ALPHA5-SNAPSHOT.jar

          If you were using 3.1.5.GA, then I would wonder about the death without OOME.

          Show
          trustin Trustin Lee added a comment - If you were using 3.2.0.ALPHA4, it has a known performance regression and freeze. Try this one instead: http://hudson.jboss.org/hudson/view/Netty/job/netty/3022/artifact/trunk/target/netty-3.2.0.ALPHA5-SNAPSHOT.jar If you were using 3.1.5.GA, then I would wonder about the death without OOME.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          I have used the latest stable version, as the user did the comparison with a stable version, I guess. Anyway, I may have been wrong about NETTY crashing, but I was able to reproduce it 3 or 4 times yesterday with Julien as a witness. And as I said, it may perfectly be a problem with the OS and JVM i'm using on my mac.

          If the test can be used to demonstrate the problem with NETTY, then fine, but I'm not really interested in the fact that NETTY works or not with this test, I'm just using NETTY because the user has created a test for it.

          What is important to me is to get MINA working, and as I already said, the performances I get are totally irrelevant but for this very test, as I'm running the clients and the server on one single machine. The initial test I run gave me a baseline and I'm using this baseline (ie, 13200 msg/s with MINA) to see if the fixes I'm introducing in MINA have a real impact on the code.

          Show
          elecharny Emmanuel Lecharny added a comment - I have used the latest stable version, as the user did the comparison with a stable version, I guess. Anyway, I may have been wrong about NETTY crashing, but I was able to reproduce it 3 or 4 times yesterday with Julien as a witness. And as I said, it may perfectly be a problem with the OS and JVM i'm using on my mac. If the test can be used to demonstrate the problem with NETTY, then fine, but I'm not really interested in the fact that NETTY works or not with this test, I'm just using NETTY because the user has created a test for it. What is important to me is to get MINA working, and as I already said, the performances I get are totally irrelevant but for this very test, as I'm running the clients and the server on one single machine. The initial test I run gave me a baseline and I'm using this baseline (ie, 13200 msg/s with MINA) to see if the fixes I'm introducing in MINA have a real impact on the code.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Postponed to 2.0.1

          Show
          elecharny Emmanuel Lecharny added a comment - Postponed to 2.0.1
          Hide
          elecharny Emmanuel Lecharny added a comment -

          This is really an issue that needs a fix, but I won't have time to deal with it for 2.0.2. Postponed to 2.0.3

          Show
          elecharny Emmanuel Lecharny added a comment - This is really an issue that needs a fix, but I won't have time to deal with it for 2.0.2. Postponed to 2.0.3
          Hide
          maxwindiff Max Ng added a comment -

          I am implementing a TCP server with 2.0.4 and runs into the same problem (a lot of reads, very few writes, OutOfMemoryError eventually). I am running Java 1.6.0_29 on OS X Lion. Will this be fixed in 2.0.5?

          If I want to throttle reads manually, what is the recommended approach? I would like to suspend all reads when memory usage exceed a threshold, forcing TCP flow control to kick in and stop the client from sending more data. Can IoEventQueueThrottle be used for this purpose? What does the threshold parameter in its constructor mean?

          Show
          maxwindiff Max Ng added a comment - I am implementing a TCP server with 2.0.4 and runs into the same problem (a lot of reads, very few writes, OutOfMemoryError eventually). I am running Java 1.6.0_29 on OS X Lion. Will this be fixed in 2.0.5? If I want to throttle reads manually, what is the recommended approach? I would like to suspend all reads when memory usage exceed a threshold, forcing TCP flow control to kick in and stop the client from sending more data. Can IoEventQueueThrottle be used for this purpose? What does the threshold parameter in its constructor mean?
          Hide
          b.eckenfels Bernd Eckenfels added a comment -

          There are two things here, in normal operation where the client is actually receiving data the read/write should be balanced. However the danger of DOS by overloading the write buffer (by a client which does not read and is able to trigger more answeres) is a question of application design. Besides a capacity limited write queue the future/sent callbacks can be used to monitor outstanding writes. And another note: the write queue objects are much too heavy in terms of overhead. It would be good if a simple application can only enqeue iobuffers, and maybe have a simple integer sequence for tracking/callbacks/futures.

          Show
          b.eckenfels Bernd Eckenfels added a comment - There are two things here, in normal operation where the client is actually receiving data the read/write should be balanced. However the danger of DOS by overloading the write buffer (by a client which does not read and is able to trigger more answeres) is a question of application design. Besides a capacity limited write queue the future/sent callbacks can be used to monitor outstanding writes. And another note: the write queue objects are much too heavy in terms of overhead. It would be good if a simple application can only enqeue iobuffers, and maybe have a simple integer sequence for tracking/callbacks/futures.
          Hide
          elecharny Emmanuel Lecharny added a comment - - edited

          4 years ahead, this is not much a MINA pb than an application implementation issue. If the application is not able to process the incoming messages fast enough, then either a firewall should be installed (malevolent client prevention) or the application has to be redesigned.

          If the client cannot read the message fast enough, teh application should wait for the current message to be sent before writing the next one.

          Show
          elecharny Emmanuel Lecharny added a comment - - edited 4 years ahead, this is not much a MINA pb than an application implementation issue. If the application is not able to process the incoming messages fast enough, then either a firewall should be installed (malevolent client prevention) or the application has to be redesigned. If the client cannot read the message fast enough, teh application should wait for the current message to be sent before writing the next one.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          A carefully drafted server will not get hit by such a pb.

          Show
          elecharny Emmanuel Lecharny added a comment - A carefully drafted server will not get hit by such a pb.

            People

            • Assignee:
              elecharny Emmanuel Lecharny
              Reporter:
              elecharny Emmanuel Lecharny
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development