Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Workaround
    • Affects Version/s: 1.6.0
    • Fix Version/s: 1.7.0

      Description

      I have an implementation of SSHD server with the library. It sends gigabytes (e.g. 5GB) of data as command output.

      Starting from Putty plink 0.68 (also includes plink 0.69) we started to have OOM errors. Checking memory dumps shown the most of the memory is consumed from the function

      org.apache.sshd.common.session.AbstractSession#writePacket(org.apache.sshd.common.util.buffer.Buffer)

      In the hprof I see thousands of PendingWriteFuture objects (btw, each holds a reference to a logger instance). And those objects are only created from this function.

      It is clear the session is running through rekey. I see the kexState indicating the progress.

      Is there a way to artificially limit the sending queue, no matter if related remote window allows sending that enormous amount of data? As of my estimation, the window was reported to be around 1.5 GB or more. Maybe, such huge window size was caused by an arithmetic overflow that is fixed on SSHD-701

        Issue Links

          Activity

          Hide
          lgoldstein Goldstein Lyor added a comment -

          Added capability to register a ChannelStreamPacketWriterResolver through which one can wrap the channel inside one's own code that can do throttling. See also (experimental) ThrottlingPacketWriter example in sshd-contrib module.

          Show
          lgoldstein Goldstein Lyor added a comment - Added capability to register a ChannelStreamPacketWriterResolver through which one can wrap the channel inside one's own code that can do throttling. See also (experimental) ThrottlingPacketWriter example in sshd-contrib module.
          Hide
          lgoldstein Goldstein Lyor added a comment -

          Looks like a good idea - I believe though that the better solution would be to implement such a "throttling" at the channel level instead of the session - after all, we might want to control each channel separately. I.e., ChannelOutputStream/ChannelAsyncOutputStream - write/flush. Furthermore, the "throttle" rate should be configurable - where zero (default) means "no throttling".

          Show
          lgoldstein Goldstein Lyor added a comment - Looks like a good idea - I believe though that the better solution would be to implement such a "throttling" at the channel level instead of the session - after all, we might want to control each channel separately. I.e., ChannelOutputStream/ChannelAsyncOutputStream - write/flush . Furthermore, the "throttle" rate should be configurable - where zero (default) means "no throttling".
          Hide
          lgoldstein Goldstein Lyor added a comment -

          I have not had time to really think about the scenario you presented, and nowadays it will take me a while to find some free time to review it. Please feel free to go ahead and publish a pull-request once you think you have a proposal in mind - it will be easier for us to review it and better understand the problem and its possible fix.

          That being said, off the top of my head - I believe one can control the window size via properties and/or window-size SSH messages (don't remember offhand the exact message exchange). AFAIK, the window size mechanism provides a way to throttle the data rate, so maybe that can be a way to address this issue as well.

          Show
          lgoldstein Goldstein Lyor added a comment - I have not had time to really think about the scenario you presented, and nowadays it will take me a while to find some free time to review it. Please feel free to go ahead and publish a pull-request once you think you have a proposal in mind - it will be easier for us to review it and better understand the problem and its possible fix. That being said, off the top of my head - I believe one can control the window size via properties and/or window-size SSH messages (don't remember offhand the exact message exchange). AFAIK, the window size mechanism provides a way to throttle the data rate, so maybe that can be a way to address this issue as well.
          Hide
          jonnyzzz Eugene Petrenko added a comment -

          It is indeed a non-trivial problem that I have discovered. First, I want to make sure the proposed fix is good enough to be converted into a pull request.

          The problem I have is that on the SSH server:
          1) channel remote window is set to be 2GB
          2) client is slow to receive data (data is generated faster that it is being consumed by network/client)
          2a) Slow network connection
          2b) Rekey is running

          Because of 1) and 2) we have OOM - too many data chunks are queued.

          The proposed solution is to limit send queue only for DATA and EXTENDED_DATA messages (is that correct?), by blocking too active sender (deadlock is possible if we block a NIO/callback thread)

          Alternative solution can be to implement similar logic in org.apache.sshd.common.channel.ChannelOutputStream and org.apache.sshd.common.channel.ChannelAsyncOutputStream. Also, there might be other usages of Channel/Session, thus fixing those 2 classes may not be enough and trick to avoid code duplication

          What would you say?

          Show
          jonnyzzz Eugene Petrenko added a comment - It is indeed a non-trivial problem that I have discovered. First, I want to make sure the proposed fix is good enough to be converted into a pull request. The problem I have is that on the SSH server: 1) channel remote window is set to be 2GB 2) client is slow to receive data (data is generated faster that it is being consumed by network/client) 2a) Slow network connection 2b) Rekey is running Because of 1) and 2) we have OOM - too many data chunks are queued. The proposed solution is to limit send queue only for DATA and EXTENDED_DATA messages (is that correct?), by blocking too active sender (deadlock is possible if we block a NIO/callback thread) Alternative solution can be to implement similar logic in org.apache.sshd.common.channel.ChannelOutputStream and org.apache.sshd.common.channel.ChannelAsyncOutputStream. Also, there might be other usages of Channel/Session, thus fixing those 2 classes may not be enough and trick to avoid code duplication What would you say?
          Hide
          lgoldstein Goldstein Lyor added a comment -

          If indeed you have a fix / improvement in mind, a pull-request is the best way to have it evaluated and eventually merged...

          Show
          lgoldstein Goldstein Lyor added a comment - If indeed you have a fix / improvement in mind, a pull-request is the best way to have it evaluated and eventually merged...
          Hide
          jonnyzzz Eugene Petrenko added a comment -

          After reproducing the problem in tests I was able to come up with the following patch (in Kotlin) for an inheritor of ServerSessionImpl to fix the test

            private class PressureLock {
              private val semaphore = Semaphore(100)
              fun acquire() : SshFutureListener<IoWriteFuture?> {
                semaphore.acquire()
                return listener
              }
          
              private val listener = object : SshFutureListener<IoWriteFuture?> {
                override fun operationComplete(future: IoWriteFuture?) {
                  semaphore.release()
                }
              }
            }
          
            private val CHANNEL_STDOUT_LOCK = PressureLock()
            private val CHANNEL_STDERR_LOCK = PressureLock()
          
            override fun writePacket(buffer: Buffer): IoWriteFuture {
              // The workaround for VCS-797
              // and https://issues.apache.org/jira/browse/SSHD-754
              // the trick is to block writer thread once there are more
              // than 100 messages in either rekey wait queue or nio write queue
              val lock = when (buffer.array()[buffer.rpos()]) {
                SshConstants.SSH_MSG_CHANNEL_DATA -> CHANNEL_STDOUT_LOCK
                SshConstants.SSH_MSG_CHANNEL_EXTENDED_DATA -> CHANNEL_STDERR_LOCK
                else -> null
              }?.acquire()
          
              val future = super.writePacket(buffer)
          
              if (lock != null) {
                future.addListener(lock)
              }
              return future
            }
          }
          
          Show
          jonnyzzz Eugene Petrenko added a comment - After reproducing the problem in tests I was able to come up with the following patch (in Kotlin) for an inheritor of ServerSessionImpl to fix the test private class PressureLock { private val semaphore = Semaphore(100) fun acquire() : SshFutureListener<IoWriteFuture?> { semaphore.acquire() return listener } private val listener = object : SshFutureListener<IoWriteFuture?> { override fun operationComplete( future : IoWriteFuture?) { semaphore.release() } } } private val CHANNEL_STDOUT_LOCK = PressureLock() private val CHANNEL_STDERR_LOCK = PressureLock() override fun writePacket(buffer: Buffer): IoWriteFuture { // The workaround for VCS-797 // and https://issues.apache.org/jira/browse/SSHD-754 // the trick is to block writer thread once there are more // than 100 messages in either rekey wait queue or nio write queue val lock = when (buffer.array()[buffer.rpos()]) { SshConstants.SSH_MSG_CHANNEL_DATA -> CHANNEL_STDOUT_LOCK SshConstants.SSH_MSG_CHANNEL_EXTENDED_DATA -> CHANNEL_STDERR_LOCK else -> null }?.acquire() val future = super .writePacket(buffer) if (lock != null ) { future .addListener(lock) } return future } }
          Hide
          jonnyzzz Eugene Petrenko added a comment -

          It looks like starting from April 2016, plink enables 'simple' mode for SSH. It was done by commit b22c0b6f3e6f5254270a89f86df3edfc4da829d2
          https://git.tartarus.org/?p=simon/putty.git;a=commit;h=b22c0b6f3e6f5254270a89f86df3edfc4da829d2.

          Starting from 0.68 it sends 2GB (0x7fffffff) as receive window size. That makes SSHD library to be vulnerable for OOM.

          The simplified STR is as follows. We need a command that returns huge portion of data, say bigger than heap size. Next, it is only enough to have a client which is slow to read data (e.g. slow channel), the server will easily queue to many packets and it will have OOM out of that.

          Looks like the org.apache.sshd.common.channel.ChannelOutputStream and similar classes should take into account not only the window size but also it's own write queue.

          Show
          jonnyzzz Eugene Petrenko added a comment - It looks like starting from April 2016, plink enables 'simple' mode for SSH. It was done by commit b22c0b6f3e6f5254270a89f86df3edfc4da829d2 https://git.tartarus.org/?p=simon/putty.git;a=commit;h=b22c0b6f3e6f5254270a89f86df3edfc4da829d2 . Starting from 0.68 it sends 2GB (0x7fffffff) as receive window size. That makes SSHD library to be vulnerable for OOM. The simplified STR is as follows. We need a command that returns huge portion of data, say bigger than heap size. Next, it is only enough to have a client which is slow to read data (e.g. slow channel), the server will easily queue to many packets and it will have OOM out of that. Looks like the org.apache.sshd.common.channel.ChannelOutputStream and similar classes should take into account not only the window size but also it's own write queue.
          Hide
          jonnyzzz Eugene Petrenko added a comment -

          Did a try with 1.4.0, where SSHD-701 is closed. Same issue. I have enormous remote windows size, so anything fits into the window, generating endless pending write queue.

          Show
          jonnyzzz Eugene Petrenko added a comment - Did a try with 1.4.0, where SSHD-701 is closed. Same issue. I have enormous remote windows size, so anything fits into the window, generating endless pending write queue.

            People

            • Assignee:
              lgoldstein Goldstein Lyor
              Reporter:
              jonnyzzz Eugene Petrenko
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development