Uploaded image for project: 'MINA'
  1. MINA
  2. DIRMINA-1021

MINA-CORE does not remove sessions if exceptions occur while closing

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.8
    • Fix Version/s: 2.0.14
    • Component/s: None
    • Labels:
      None
    • Environment:
      mina-ssh 0.14.0
      mina-core 2.0.8
      Multiple OSes / Java configurations:
      * Mac OS X El Capitan on Java 8 (1.8.0_60)
      * CentOS 6.4 on Java 8 (1.8.0_60)
      * CentOS 6.5 on Java 8 (1.8.0_20-b26).
    • Flags:
      Patch

      Description

      MINA SSHD isn't removing sessions when using the MINA/NIO backend if an exception as received as the session is closing (such as a connection reset is received with data still in the write buffer). When this case happens, it seems that NioProcessor.getState returns the state as CLOSING (I'm assuming since the underlying channel is now closed), which means that the AbstractPollingIoProcessor.removeSessions() won't ever prune the session, since a CLOSING state is simply ignored. The result is a resource leak over time, since these sessions are never pruned (it's a slow leak, since entering this condition is racy – on my workstation, I can produce it through randomly interrupting connections anywhere from 1/6 to 1/10th of the time). (This may either be major or critical; reprioritize as necessary.)

      I specifically see this error with Gerrit 2.10.4 and Gerrit 2.11.5 (using mina-sshd 0.14.0 / mina-core 2.0.8), and it looks like the code path is unchanged in mina-sshd 1.0.0 / mina-core 2.0.9. I was unsure if this is specifically a bug in mina-core or, if it's something unique to mina-sshd. My local development system runs Mac OS X El Capitan on Java 8 (1.8.0_60), but I've also seen this on Linux (CentOS 6.4, again Java 1.8.0_60 and CentOS 6.5 on Java 1.8.0_20-b26).

      The fix may be as simple as attempting to remove the session if OPENED or CLOSING, but I'm unsure what side-effects this may have with other backends. I'll be happy to test it locally, but I'm fairly ignorant when it comes to MINA's code.

      The attached patch (to mina-core) seems to resolve the issue by following the reproduction case I have on the Gerrit issue tracker.

      1. mina-core-2.0.14-SNAPSHOT.jar
        616 kB
        Emmanuel Lecharny
      2. mina-core-2.0.14-SNAPSHOT.jar
        616 kB
        Emmanuel Lecharny
      3. attempt-removing-sessions-closing.patch
        1 kB
        Doug Kelly

        Issue Links

          Activity

          Hide
          elecharny Emmanuel Lecharny added a comment -

          This is not what happens.

          The removeSessions() polls the removingSessions queue, which means the session is removed from this queue. At this point, the session should be reclaimed by the GC, assuming nothing else is using it.

          The question is : what holds this session, causing the leak ?

          Show
          elecharny Emmanuel Lecharny added a comment - This is not what happens. The removeSessions() polls the removingSessions queue, which means the session is removed from this queue. At this point, the session should be reclaimed by the GC, assuming nothing else is using it. The question is : what holds this session, causing the leak ?
          Hide
          dougk Doug Kelly added a comment -

          I'm not sure I understand your comment entirely: this is exactly what happens, according to my analysis in a debugger.

          As you commented, removeSessions() polls the queue, which means the session is no longer in removingSessions. However, removeNow() is never called, which calls destroy() and the fireSessionDestroyed() handlers.

          Could this patch be masking the issue because the channel should be AutoCloseable and something is still holding a reference, keeping the channel open? Maybe calling removeNow() is masking the problem by forcibly closing the underlying channel and calling fireSessionDestroyed()? From what I can tell, the connection still showed up in the session list, meaning getManagedSessions() was still tracking it, which would make sense with the "someone else is holding a reference" idea. Sessions appear to only be removed by IoServiceListenerSupport.fireSessionDestroyed() which further supports the idea.

          Is there another way you would propose fixing this?

          Show
          dougk Doug Kelly added a comment - I'm not sure I understand your comment entirely: this is exactly what happens, according to my analysis in a debugger. As you commented, removeSessions() polls the queue, which means the session is no longer in removingSessions. However, removeNow() is never called, which calls destroy() and the fireSessionDestroyed() handlers. Could this patch be masking the issue because the channel should be AutoCloseable and something is still holding a reference, keeping the channel open? Maybe calling removeNow() is masking the problem by forcibly closing the underlying channel and calling fireSessionDestroyed() ? From what I can tell, the connection still showed up in the session list, meaning getManagedSessions() was still tracking it, which would make sense with the "someone else is holding a reference" idea. Sessions appear to only be removed by IoServiceListenerSupport.fireSessionDestroyed() which further supports the idea. Is there another way you would propose fixing this?
          Hide
          elecharny Emmanuel Lecharny added a comment -

          What I mean is that the removeSessions() function role is just to close opened sessions scheduled for closing. When you actually want to close a session, you push the session to this least, whatever its status is (OPENING, OPENED or CLOSING). This is not the place where we actually close the sessions, except when they are oepned. It means that if the session is in CLOSING state, then it was already closed somewhere else.

          I'm currently looking where this close could have happened (and that includes exceptions).

          Show
          elecharny Emmanuel Lecharny added a comment - What I mean is that the removeSessions() function role is just to close opened sessions scheduled for closing. When you actually want to close a session, you push the session to this least, whatever its status is (OPENING, OPENED or CLOSING). This is not the place where we actually close the sessions, except when they are oepned. It means that if the session is in CLOSING state, then it was already closed somewhere else. I'm currently looking where this close could have happened (and that includes exceptions).
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Let's see what's happening when a session is closed because of an exception. For instance, the AbstractPollingIoProcessor.writeBuffer() method call close() when an exception occurs :

          writeBuffer()
              private int writeBuffer(S session, WriteRequest req, boolean hasFragmentation, int maxLength, long currentTime)
                      throws Exception {
                  ...
                  if (buf.hasRemaining()) {
                      ...
                      try {
                          localWrittenBytes = write(session, buf, length);
                      } catch (IOException ioe) {
                          // We have had an issue while trying to send data to the
                          // peer : let's close the session.
                          buf.free();
                          session.close(true);
                          destroy(session);
          
                          return 0;
                      }
              ...
          

          (let's ignore the destroy(session) atm).

          The session.close( true ) execute this code :

          close()
              public final CloseFuture close(boolean rightNow) {
                  if (!isClosing()) {
                      if (rightNow) {
                          synchronized (lock) {
                              if (isClosing()) {
                                  return closeFuture;
                              }
          
                              closing = true;
                          }
          
                          getFilterChain().fireFilterClose();
          
                          return closeFuture;
                      }
          
                      return closeOnFlush();
                  } else {
                      return closeFuture;
                  }
              }
          

          The important part is getFilterChain().fireFilterClose(), which implementation is :

          fireFilterClose()
              public void fireFilterClose() {
                  callPreviousFilterClose(tail, session);
              }
          

          which calls :

          callPreviousFilterClose()
              private void callPreviousFilterClose(Entry entry, IoSession session) {
                  try {
                      IoFilter filter = entry.getFilter();
                      NextFilter nextFilter = entry.getNextFilter();
                      filter.filterClose(nextFilter, session);
                  } catch (Exception e) {
                      fireExceptionCaught(e);
                  } catch (Error e) {
                      fireExceptionCaught(e);
                      throw e;
                  }
              }
          

          which in turns call ultimately :

          HeadFilter.filterClose()
                  public void filterClose(NextFilter nextFilter, IoSession session) throws Exception {
                      ((AbstractIoSession) session).getProcessor().remove(session);
                  }
          

          then :

          AbstractPollingIoProcessorremove()
              public final void remove(S session) {
                  scheduleRemove(session);
                  startupProcessor();
              }
          
              private void scheduleRemove(S session) {
                  removingSessions.add(session);
              }
          

          At this point, closing a session is not enough. Either you wait on the returned CloseFuture instance, or you actually destroy the session (which is probably what is missing in your case).

          Show
          elecharny Emmanuel Lecharny added a comment - Let's see what's happening when a session is closed because of an exception. For instance, the AbstractPollingIoProcessor.writeBuffer() method call close() when an exception occurs : writeBuffer() private int writeBuffer(S session, WriteRequest req, boolean hasFragmentation, int maxLength, long currentTime) throws Exception { ... if (buf.hasRemaining()) { ... try { localWrittenBytes = write(session, buf, length); } catch (IOException ioe) { // We have had an issue while trying to send data to the // peer : let's close the session. buf.free(); session.close( true ); destroy(session); return 0; } ... (let's ignore the destroy(session) atm). The session.close( true ) execute this code : close() public final CloseFuture close( boolean rightNow) { if (!isClosing()) { if (rightNow) { synchronized (lock) { if (isClosing()) { return closeFuture; } closing = true ; } getFilterChain().fireFilterClose(); return closeFuture; } return closeOnFlush(); } else { return closeFuture; } } The important part is getFilterChain().fireFilterClose() , which implementation is : fireFilterClose() public void fireFilterClose() { callPreviousFilterClose(tail, session); } which calls : callPreviousFilterClose() private void callPreviousFilterClose(Entry entry, IoSession session) { try { IoFilter filter = entry.getFilter(); NextFilter nextFilter = entry.getNextFilter(); filter.filterClose(nextFilter, session); } catch (Exception e) { fireExceptionCaught(e); } catch (Error e) { fireExceptionCaught(e); throw e; } } which in turns call ultimately : HeadFilter.filterClose() public void filterClose(NextFilter nextFilter, IoSession session) throws Exception { ((AbstractIoSession) session).getProcessor().remove(session); } then : AbstractPollingIoProcessorremove() public final void remove(S session) { scheduleRemove(session); startupProcessor(); } private void scheduleRemove(S session) { removingSessions.add(session); } At this point, closing a session is not enough. Either you wait on the returned CloseFuture instance, or you actually destroy the session (which is probably what is missing in your case).
          Hide
          joris.w joris welkenhuysen added a comment - - edited

          Could it be that the result of the getState(NioSession session) method doesn't reliably indicate that the session is OPENED ?

          If that is the case, it could be the reason why the removeSessions() method doesn't do what should be done to remove the session

          Show
          joris.w joris welkenhuysen added a comment - - edited Could it be that the result of the getState(NioSession session) method doesn't reliably indicate that the session is OPENED ? If that is the case, it could be the reason why the removeSessions() method doesn't do what should be done to remove the session
          Hide
          elecharny Emmanuel Lecharny added a comment -

          The getState() method is pretty straightforward : it just check the session SelectionKey, and return OPENING when we have none, OPENED when it's valid and CLOSING when it's not valid. What could happend is that the method returns OPENED but the session is not anymore in this state when we check the status, because something happened in the meantime.

          That does not matter too much because whatever the returned status could be, it's too late in the removeSessions() to deal with the session removal, escpecially when we have got an exception. This is typically a case where we should absolutely clean the session while we are processing the exception, because otherwise, nothing will do it...

          Show
          elecharny Emmanuel Lecharny added a comment - The getState() method is pretty straightforward : it just check the session SelectionKey , and return OPENING when we have none, OPENED when it's valid and CLOSING when it's not valid. What could happend is that the method returns OPENED but the session is not anymore in this state when we check the status, because something happened in the meantime. That does not matter too much because whatever the returned status could be, it's too late in the removeSessions() to deal with the session removal, escpecially when we have got an exception. This is typically a case where we should absolutely clean the session while we are processing the exception, because otherwise, nothing will do it...
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Actually, when an exception is caught, it's propagated to the IoHandler, which should be in charge to close the session. I suspect this is not done in SSHD, but I haven't yet looked at the code.

          Beside reading the code, and mentally try to imagine what are the possible paths that could cause the problem, I don't have a lot of clue. What would be very helpful is a piece of code that demonstrate the issue,, so that I can debug it.

          And, yes, analyzing asynchronous code is really complex, those who pretend that async is the way to go are either genious or are totally ignorant of the complexity it brings... Most of them enter in the second category, I'm afraid. Keep in mind that MINA was created to hide this aspect to most of the user, but sadly, at some point, you might get snapped...

          Show
          elecharny Emmanuel Lecharny added a comment - Actually, when an exception is caught, it's propagated to the IoHandler , which should be in charge to close the session. I suspect this is not done in SSHD, but I haven't yet looked at the code. Beside reading the code, and mentally try to imagine what are the possible paths that could cause the problem, I don't have a lot of clue. What would be very helpful is a piece of code that demonstrate the issue,, so that I can debug it. And, yes, analyzing asynchronous code is really complex, those who pretend that async is the way to go are either genious or are totally ignorant of the complexity it brings... Most of them enter in the second category, I'm afraid. Keep in mind that MINA was created to hide this aspect to most of the user, but sadly, at some point, you might get snapped...
          Hide
          elecharny Emmanuel Lecharny added a comment -

          I wonder if the problem is not in sshd code. Typically, when handling an exception when teh sessionn is closing, nothing is done in AbstractSession.exceptionCaught:

              /**
               * Handle any exceptions that occured on this session.
               * The session will be closed and a disconnect packet will be
               * sent before if the given exception is an
               * {@link org.apache.sshd.common.SshException}.
               * 
               * @param t the exception to process
               */
              public void exceptionCaught(Throwable t) {
                  // Ignore exceptions that happen while closing
                  synchronized (lock) {
                      if (isClosing()) {
                          return;
                      }
                  }
          ...
          
          Show
          elecharny Emmanuel Lecharny added a comment - I wonder if the problem is not in sshd code. Typically, when handling an exception when teh sessionn is closing, nothing is done in AbstractSession.exceptionCaught : /** * Handle any exceptions that occured on this session. * The session will be closed and a disconnect packet will be * sent before if the given exception is an * {@link org.apache.sshd.common.SshException}. * * @param t the exception to process */ public void exceptionCaught(Throwable t) { // Ignore exceptions that happen while closing synchronized (lock) { if (isClosing()) { return; } } ...
          Hide
          dougk Doug Kelly added a comment -

          Hi Emmanuel,

          I've been continuing to look into this (but I've been sidetracked over the past week), but I was having issues reproducing and trapping the bug in a debugger again. I was able to pretty simply reproduce the issue with Gerrit (which is far more than a simple test application), but if you want help getting that test environment set up so you can debug locally, please let me know!

          I'll try to dedicate some more time to this issue, since it is important to me (and I appreciate the time you've given it), so please let me know if there's anything I can do.

          Show
          dougk Doug Kelly added a comment - Hi Emmanuel, I've been continuing to look into this (but I've been sidetracked over the past week), but I was having issues reproducing and trapping the bug in a debugger again. I was able to pretty simply reproduce the issue with Gerrit (which is far more than a simple test application), but if you want help getting that test environment set up so you can debug locally, please let me know! I'll try to dedicate some more time to this issue, since it is important to me (and I appreciate the time you've given it), so please let me know if there's anything I can do.
          Hide
          lgoldstein Goldstein Lyor added a comment - - edited

          Not my area of expertise, but from what I could see in the code, the reason that nothing is done if isClosing is that it means that the session is being closed anyway via the code in AbstractCloseable#close. In other words, this is a kind of protection against running the close code concurrently.

          Show
          lgoldstein Goldstein Lyor added a comment - - edited Not my area of expertise, but from what I could see in the code, the reason that nothing is done if isClosing is that it means that the session is being closed anyway via the code in AbstractCloseable#close . In other words, this is a kind of protection against running the close code concurrently.
          Hide
          lgoldstein Goldstein Lyor added a comment -

          I would also recommended enabling logging of SSHD as there seem to be rather detailed messages during the close process

          Show
          lgoldstein Goldstein Lyor added a comment - I would also recommended enabling logging of SSHD as there seem to be rather detailed messages during the close process
          Hide
          elecharny Emmanuel Lecharny added a comment -

          I would have suggested the same thing, but you beat me

          Show
          elecharny Emmanuel Lecharny added a comment - I would have suggested the same thing, but you beat me
          Hide
          elecharny Emmanuel Lecharny added a comment - - edited

          yes, but...

          what happens is that isClosing simply means the session is scheduled for closing, not actually closed. The rationnal being that some data might be still waiting to be sent, and because the server has decided to close the session does not necessarily implies those data should be discarded.

          In this very case, we have got an exception, mostly because the client is not there anymore, thus there is no reason to wait for the pending data to be kept in memory and for the session to remain in this state forever. This is why it's good to close the session immediately when an exception is caught.

          Show
          elecharny Emmanuel Lecharny added a comment - - edited yes, but... what happens is that isClosing simply means the session is scheduled for closing, not actually closed. The rationnal being that some data might be still waiting to be sent, and because the server has decided to close the session does not necessarily implies those data should be discarded. In this very case, we have got an exception, mostly because the client is not there anymore, thus there is no reason to wait for the pending data to be kept in memory and for the session to remain in this state forever. This is why it's good to close the session immediately when an exception is caught.
          Hide
          lgoldstein Goldstein Lyor added a comment -

          I think the only time that a session is scheduled for closing rather than being closed immediately is when a graceful close is in progress while an exception is caught. But even then, I assume the operationComplete is eventually called, and it will then do an immediate close of the MINA IoSession, which then should clean up after itself...

          Show
          lgoldstein Goldstein Lyor added a comment - I think the only time that a session is scheduled for closing rather than being closed immediately is when a graceful close is in progress while an exception is caught. But even then, I assume the operationComplete is eventually called, and it will then do an immediate close of the MINA IoSession , which then should clean up after itself...
          Hide
          dougk Doug Kelly added a comment -

          Ah, glad you mentioned that! Here's what happens in when enabling debug logging in Gerrit (working successfully – as in, this did not trigger the bug):

          [2015-11-19 14:43:22,812] DEBUG com.google.gerrit.sshd.GerritServerSession : Received SSH_MSG_DISCONNECT (reason=11, msg=disconnected by user)
          [2015-11-19 14:43:22,813] DEBUG com.google.gerrit.sshd.GerritServerSession : Closing GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319] immediately
          [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.server.session.ServerConnectionService : Closing ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] immediately
          [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.common.forward.DefaultTcpipForwarder : Closing DefaultTcpipForwarder[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] immediately
          [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.common.forward.DefaultTcpipForwarder : DefaultTcpipForwarder[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] closed
          [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.agent.common.AgentForwardSupport : Closing AgentForwardSupport[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] immediately
          [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.agent.common.AgentForwardSupport : AgentForwardSupport[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] closed
          [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : Closing org.apache.sshd.server.x11.X11ForwardSupport@16efad0a immediately
          [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : org.apache.sshd.server.x11.X11ForwardSupport@16efad0a closed
          [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.server.channel.ChannelSession : Closing ChannelSession[id=0, recipient=0] immediately
          [2015-11-19 14:43:22,860] DEBUG org.apache.sshd.server.channel.ChannelSession : ChannelSession[id=0, recipient=0] closed
          [2015-11-19 14:43:22,860] DEBUG org.apache.sshd.server.session.ServerConnectionService : ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] closed
          [2015-11-19 14:43:22,860] DEBUG org.apache.sshd.common.io.mina.MinaSession : Closing MinaSession[local=/0:0:0:0:0:0:0:1:29418, remote=/0:0:0:0:0:0:0:1:56319] immediately
          [2015-11-19 14:43:22,860] DEBUG org.apache.sshd.common.io.mina.MinaSession : MinaSession[local=/0:0:0:0:0:0:0:1:29418, remote=/0:0:0:0:0:0:0:1:56319] closed
          [2015-11-19 14:43:22,861] DEBUG com.google.gerrit.sshd.GerritServerSession : GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319] closed
          [2015-11-19 14:43:22,862] DEBUG com.google.gerrit.sshd.GerritServerSession : GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319] is already closed
          [2015-11-19 14:43:22,864] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user dougk account 1000000) during git-upload-pack '/gerrit'
          org.apache.sshd.common.channel.WindowClosedException: Already closed
          

          Here's a case where I successfully reproduced the issue:

          [2015-11-19 14:47:20,202] WARN  com.google.gerrit.sshd.GerritServerSession : Exception caught
          java.nio.channels.CancelledKeyException
                  at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
                  at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
                  at org.apache.mina.transport.socket.nio.NioProcessor.setInterestedInWrite(NioProcessor.java:287)
                  at org.apache.mina.transport.socket.nio.NioProcessor.setInterestedInWrite(NioProcessor.java:45)
                  at org.apache.mina.core.polling.AbstractPollingIoProcessor.flushNow(AbstractPollingIoProcessor.java:880)
                  at org.apache.mina.core.polling.AbstractPollingIoProcessor.flush(AbstractPollingIoProcessor.java:778)
                  at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$700(AbstractPollingIoProcessor.java:67)
                  at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1126)
                  at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                  at java.lang.Thread.run(Thread.java:745)
          [2015-11-19 14:47:20,202] DEBUG com.google.gerrit.sshd.GerritServerSession : Closing GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415] immediately
          [2015-11-19 14:47:20,202] DEBUG org.apache.sshd.server.session.ServerConnectionService : Closing ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] immediately
          [2015-11-19 14:47:20,202] DEBUG org.apache.sshd.common.forward.DefaultTcpipForwarder : Closing DefaultTcpipForwarder[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] immediately
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.common.forward.DefaultTcpipForwarder : DefaultTcpipForwarder[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] closed
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.agent.common.AgentForwardSupport : Closing AgentForwardSupport[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] immediately
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.agent.common.AgentForwardSupport : AgentForwardSupport[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] closed
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : Closing org.apache.sshd.server.x11.X11ForwardSupport@5774e544 immediately
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.channel.ChannelSession : Send SSH_MSG_CHANNEL_DATA on channel 0
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : org.apache.sshd.server.x11.X11ForwardSupport@5774e544 closed
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.channel.ChannelSession : Closing ChannelSession[id=0, recipient=0] immediately
          [2015-11-19 14:47:20,209] DEBUG org.apache.sshd.server.channel.ChannelSession : ChannelSession[id=0, recipient=0] closed
          [2015-11-19 14:47:20,209] DEBUG org.apache.sshd.server.session.ServerConnectionService : ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] closed
          [2015-11-19 14:47:20,210] DEBUG org.apache.sshd.common.io.mina.MinaSession : Closing MinaSession[local=0.0.0.0/0.0.0.0:29418, remote=/0:0:0:0:0:0:0:1:56415] immediately
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.channel.ChannelSession : Send SSH_MSG_CHANNEL_DATA on channel 0
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : org.apache.sshd.server.x11.X11ForwardSupport@5774e544 closed
          [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.channel.ChannelSession : Closing ChannelSession[id=0, recipient=0] immediately
          [2015-11-19 14:47:20,209] DEBUG org.apache.sshd.server.channel.ChannelSession : ChannelSession[id=0, recipient=0] closed
          [2015-11-19 14:47:20,209] DEBUG org.apache.sshd.server.session.ServerConnectionService : ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] closed
          [2015-11-19 14:47:20,210] DEBUG org.apache.sshd.common.io.mina.MinaSession : Closing MinaSession[local=0.0.0.0/0.0.0.0:29418, remote=/0:0:0:0:0:0:0:1:56415] immediately
          [2015-11-19 14:47:20,210] DEBUG org.apache.sshd.common.io.mina.MinaSession : MinaSession[local=0.0.0.0/0.0.0.0:29418, remote=/0:0:0:0:0:0:0:1:56415] closed
          [2015-11-19 14:47:20,210] DEBUG com.google.gerrit.sshd.GerritServerSession : GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415] closed
          

          Also, a stack trace from when I ended up in exceptionCaught() (i.e. this triggered the bug):

          Thread [NioProcessor-2] (Suspended (breakpoint at line 47 in AbstractSessionIoHandler))
                  SshDaemon$1(AbstractSessionIoHandler).exceptionCaught(IoSession, Throwable) line: 47
                  MinaAcceptor(MinaService).exceptionCaught(IoSession, Throwable) line: 101
                  DefaultIoFilterChain$TailFilter.exceptionCaught(IoFilter$NextFilter, IoSession, Throwable) line: 824
                  DefaultIoFilterChain.callNextExceptionCaught(IoFilterChain$Entry, IoSession, Throwable) line: 590
                  DefaultIoFilterChain.access$1100(DefaultIoFilterChain, IoFilterChain$Entry, IoSession, Throwable) line: 48
                  DefaultIoFilterChain$EntryImpl$1.exceptionCaught(IoSession, Throwable) line: 933
                  DefaultIoFilterChain$HeadFilter(IoFilterAdapter).exceptionCaught(IoFilter$NextFilter, IoSession, Throwable) line: 102
                  DefaultIoFilterChain.callNextExceptionCaught(IoFilterChain$Entry, IoSession, Throwable) line: 590
                  DefaultIoFilterChain.fireExceptionCaught(Throwable) line: 580
                  NioProcessor(AbstractPollingIoProcessor<S>).read(S) line: 741
                  NioProcessor(AbstractPollingIoProcessor<S>).process(S) line: 668
                  NioProcessor(AbstractPollingIoProcessor<S>).process() line: 657
                  AbstractPollingIoProcessor<S>.access$600(AbstractPollingIoProcessor) line: 67
                  AbstractPollingIoProcessor$Processor.run() line: 1121
                  NamePreservingRunnable.run() line: 64
                  ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1142
                  ThreadPoolExecutor$Worker.run() line: 617
                  Thread.run() line: 745
          
          Show
          dougk Doug Kelly added a comment - Ah, glad you mentioned that! Here's what happens in when enabling debug logging in Gerrit (working successfully – as in, this did not trigger the bug): [2015-11-19 14:43:22,812] DEBUG com.google.gerrit.sshd.GerritServerSession : Received SSH_MSG_DISCONNECT (reason=11, msg=disconnected by user) [2015-11-19 14:43:22,813] DEBUG com.google.gerrit.sshd.GerritServerSession : Closing GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319] immediately [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.server.session.ServerConnectionService : Closing ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] immediately [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.common.forward.DefaultTcpipForwarder : Closing DefaultTcpipForwarder[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] immediately [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.common.forward.DefaultTcpipForwarder : DefaultTcpipForwarder[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] closed [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.agent.common.AgentForwardSupport : Closing AgentForwardSupport[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] immediately [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.agent.common.AgentForwardSupport : AgentForwardSupport[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] closed [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : Closing org.apache.sshd.server.x11.X11ForwardSupport@16efad0a immediately [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : org.apache.sshd.server.x11.X11ForwardSupport@16efad0a closed [2015-11-19 14:43:22,815] DEBUG org.apache.sshd.server.channel.ChannelSession : Closing ChannelSession[id=0, recipient=0] immediately [2015-11-19 14:43:22,860] DEBUG org.apache.sshd.server.channel.ChannelSession : ChannelSession[id=0, recipient=0] closed [2015-11-19 14:43:22,860] DEBUG org.apache.sshd.server.session.ServerConnectionService : ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319]] closed [2015-11-19 14:43:22,860] DEBUG org.apache.sshd.common.io.mina.MinaSession : Closing MinaSession[local=/0:0:0:0:0:0:0:1:29418, remote=/0:0:0:0:0:0:0:1:56319] immediately [2015-11-19 14:43:22,860] DEBUG org.apache.sshd.common.io.mina.MinaSession : MinaSession[local=/0:0:0:0:0:0:0:1:29418, remote=/0:0:0:0:0:0:0:1:56319] closed [2015-11-19 14:43:22,861] DEBUG com.google.gerrit.sshd.GerritServerSession : GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319] closed [2015-11-19 14:43:22,862] DEBUG com.google.gerrit.sshd.GerritServerSession : GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319] is already closed [2015-11-19 14:43:22,864] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user dougk account 1000000) during git-upload-pack '/gerrit' org.apache.sshd.common.channel.WindowClosedException: Already closed Here's a case where I successfully reproduced the issue: [2015-11-19 14:47:20,202] WARN com.google.gerrit.sshd.GerritServerSession : Exception caught java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) at org.apache.mina.transport.socket.nio.NioProcessor.setInterestedInWrite(NioProcessor.java:287) at org.apache.mina.transport.socket.nio.NioProcessor.setInterestedInWrite(NioProcessor.java:45) at org.apache.mina.core.polling.AbstractPollingIoProcessor.flushNow(AbstractPollingIoProcessor.java:880) at org.apache.mina.core.polling.AbstractPollingIoProcessor.flush(AbstractPollingIoProcessor.java:778) at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$700(AbstractPollingIoProcessor.java:67) at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1126) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) [2015-11-19 14:47:20,202] DEBUG com.google.gerrit.sshd.GerritServerSession : Closing GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415] immediately [2015-11-19 14:47:20,202] DEBUG org.apache.sshd.server.session.ServerConnectionService : Closing ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] immediately [2015-11-19 14:47:20,202] DEBUG org.apache.sshd.common.forward.DefaultTcpipForwarder : Closing DefaultTcpipForwarder[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] immediately [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.common.forward.DefaultTcpipForwarder : DefaultTcpipForwarder[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] closed [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.agent.common.AgentForwardSupport : Closing AgentForwardSupport[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] immediately [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.agent.common.AgentForwardSupport : AgentForwardSupport[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] closed [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : Closing org.apache.sshd.server.x11.X11ForwardSupport@5774e544 immediately [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.channel.ChannelSession : Send SSH_MSG_CHANNEL_DATA on channel 0 [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : org.apache.sshd.server.x11.X11ForwardSupport@5774e544 closed [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.channel.ChannelSession : Closing ChannelSession[id=0, recipient=0] immediately [2015-11-19 14:47:20,209] DEBUG org.apache.sshd.server.channel.ChannelSession : ChannelSession[id=0, recipient=0] closed [2015-11-19 14:47:20,209] DEBUG org.apache.sshd.server.session.ServerConnectionService : ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] closed [2015-11-19 14:47:20,210] DEBUG org.apache.sshd.common.io.mina.MinaSession : Closing MinaSession[local=0.0.0.0/0.0.0.0:29418, remote=/0:0:0:0:0:0:0:1:56415] immediately [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.channel.ChannelSession : Send SSH_MSG_CHANNEL_DATA on channel 0 [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.x11.X11ForwardSupport : org.apache.sshd.server.x11.X11ForwardSupport@5774e544 closed [2015-11-19 14:47:20,203] DEBUG org.apache.sshd.server.channel.ChannelSession : Closing ChannelSession[id=0, recipient=0] immediately [2015-11-19 14:47:20,209] DEBUG org.apache.sshd.server.channel.ChannelSession : ChannelSession[id=0, recipient=0] closed [2015-11-19 14:47:20,209] DEBUG org.apache.sshd.server.session.ServerConnectionService : ServerConnectionService[GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415]] closed [2015-11-19 14:47:20,210] DEBUG org.apache.sshd.common.io.mina.MinaSession : Closing MinaSession[local=0.0.0.0/0.0.0.0:29418, remote=/0:0:0:0:0:0:0:1:56415] immediately [2015-11-19 14:47:20,210] DEBUG org.apache.sshd.common.io.mina.MinaSession : MinaSession[local=0.0.0.0/0.0.0.0:29418, remote=/0:0:0:0:0:0:0:1:56415] closed [2015-11-19 14:47:20,210] DEBUG com.google.gerrit.sshd.GerritServerSession : GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56415] closed Also, a stack trace from when I ended up in exceptionCaught() (i.e. this triggered the bug): Thread [NioProcessor-2] (Suspended (breakpoint at line 47 in AbstractSessionIoHandler)) SshDaemon$1(AbstractSessionIoHandler).exceptionCaught(IoSession, Throwable) line: 47 MinaAcceptor(MinaService).exceptionCaught(IoSession, Throwable) line: 101 DefaultIoFilterChain$TailFilter.exceptionCaught(IoFilter$NextFilter, IoSession, Throwable) line: 824 DefaultIoFilterChain.callNextExceptionCaught(IoFilterChain$Entry, IoSession, Throwable) line: 590 DefaultIoFilterChain.access$1100(DefaultIoFilterChain, IoFilterChain$Entry, IoSession, Throwable) line: 48 DefaultIoFilterChain$EntryImpl$1.exceptionCaught(IoSession, Throwable) line: 933 DefaultIoFilterChain$HeadFilter(IoFilterAdapter).exceptionCaught(IoFilter$NextFilter, IoSession, Throwable) line: 102 DefaultIoFilterChain.callNextExceptionCaught(IoFilterChain$Entry, IoSession, Throwable) line: 590 DefaultIoFilterChain.fireExceptionCaught(Throwable) line: 580 NioProcessor(AbstractPollingIoProcessor<S>).read(S) line: 741 NioProcessor(AbstractPollingIoProcessor<S>).process(S) line: 668 NioProcessor(AbstractPollingIoProcessor<S>).process() line: 657 AbstractPollingIoProcessor<S>.access$600(AbstractPollingIoProcessor) line: 67 AbstractPollingIoProcessor$Processor.run() line: 1121 NamePreservingRunnable.run() line: 64 ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1142 ThreadPoolExecutor$Worker.run() line: 617 Thread .run() line: 745
          Hide
          lgoldstein Goldstein Lyor added a comment -

          I just noticed that you are using SSHD 0.14.0 which uses mina-core 2.0.8 - have you tried this with SSHD 1.0 (which uses mina-core 2.0.9). There were quite a few changes from SSHD 0.14 to 1.0 which may have fixed this issue.

          Show
          lgoldstein Goldstein Lyor added a comment - I just noticed that you are using SSHD 0.14.0 which uses mina-core 2.0.8 - have you tried this with SSHD 1.0 (which uses mina-core 2.0.9). There were quite a few changes from SSHD 0.14 to 1.0 which may have fixed this issue.
          Hide
          dougk Doug Kelly added a comment -

          I thought I might have tried that already, but I'll be happy to give it another try and let you know how it goes!

          Show
          dougk Doug Kelly added a comment - I thought I might have tried that already, but I'll be happy to give it another try and let you know how it goes!
          Hide
          lgoldstein Goldstein Lyor added a comment -

          The only reason I ask is that you said

          it looks like the code path is unchanged in mina-sshd 1.0.0 / mina-core 2.0.9

          so it is not clear whether you have actually tried it or just looked at the code.

          Show
          lgoldstein Goldstein Lyor added a comment - The only reason I ask is that you said it looks like the code path is unchanged in mina-sshd 1.0.0 / mina-core 2.0.9 so it is not clear whether you have actually tried it or just looked at the code.
          Hide
          lgoldstein Goldstein Lyor added a comment -

          I added some more code that takes care of caught exception in case of graceful session closing, as well as more detailed log messages. It may be worth trying to reproduce the problem with version 1.1 (SNAPSHOT...)

          Show
          lgoldstein Goldstein Lyor added a comment - I added some more code that takes care of caught exception in case of graceful session closing, as well as more detailed log messages. It may be worth trying to reproduce the problem with version 1.1 (SNAPSHOT...)
          Hide
          elecharny Emmanuel Lecharny added a comment -

          The stack trace is pretty much saying that the session can't be read, because it has been closed by the remote peer. In this case the CancelledKeyException results in a ExceptionCaught event being sent, and the session must deal with that (ie,it should 'commit suicide'

          Show
          elecharny Emmanuel Lecharny added a comment - The stack trace is pretty much saying that the session can't be read, because it has been closed by the remote peer. In this case the CancelledKeyException results in a ExceptionCaught event being sent, and the session must deal with that (ie,it should 'commit suicide'
          Hide
          lgoldstein Goldstein Lyor added a comment -

          As far as I can tell it does just that - it closes itself and (eventually) calls IoSession#close which (after quite a few more calls) eventually calls getFilterChain().fireFilterClose()

          Show
          lgoldstein Goldstein Lyor added a comment - As far as I can tell it does just that - it closes itself and (eventually) calls IoSession#close which (after quite a few more calls) eventually calls getFilterChain().fireFilterClose()
          Hide
          dougk Doug Kelly added a comment -

          Yeah, that's kinda what I gathered... I think the problem was, when the ExceptionCaught handlers fire, the close never actually happens: we get into the state where the connection is CLOSING, thus not removed from the session list (as returned by getManagedSessions). But, whether that's a symptom or a cause is not something I'd know. The interesting thing is in the debug logs, when things are working, the one main difference I've noticed is we actually try to close the session twice:

          [2015-11-19 14:43:22,862] DEBUG com.google.gerrit.sshd.GerritServerSession : GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319] is already closed

          We have a patch to try reproducing the issue with SSHD 1.0.0, and if that doesn't help, I'll give the 1.1 SNAPSHOT a try.

          Show
          dougk Doug Kelly added a comment - Yeah, that's kinda what I gathered... I think the problem was, when the ExceptionCaught handlers fire, the close never actually happens: we get into the state where the connection is CLOSING, thus not removed from the session list (as returned by getManagedSessions). But, whether that's a symptom or a cause is not something I'd know. The interesting thing is in the debug logs, when things are working, the one main difference I've noticed is we actually try to close the session twice: [2015-11-19 14:43:22,862] DEBUG com.google.gerrit.sshd.GerritServerSession : GerritServerSession[dougk@/0:0:0:0:0:0:0:1:56319] is already closed We have a patch to try reproducing the issue with SSHD 1.0.0, and if that doesn't help, I'll give the 1.1 SNAPSHOT a try.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          FTR, we are working on this issue on MINA side. It's related to DIRMINA-1006

          Show
          elecharny Emmanuel Lecharny added a comment - FTR, we are working on this issue on MINA side. It's related to DIRMINA-1006
          Hide
          dougk Doug Kelly added a comment -

          Thanks for your help! I should also add that Gerrit has recently updated to SSHD-1.0.0 and MINA-2.10 (some other very dedicated developers took care of resolving all the API changes). That said, if the fix is easily backported to older SSHD versions, it probably wouldn't hurt, since Gerrit won't have this on any of the current stable branches (and it would be a pretty major change to backport the SSHD upgrade to the older branches).

          Show
          dougk Doug Kelly added a comment - Thanks for your help! I should also add that Gerrit has recently updated to SSHD-1 .0.0 and MINA-2.10 (some other very dedicated developers took care of resolving all the API changes). That said, if the fix is easily backported to older SSHD versions, it probably wouldn't hurt, since Gerrit won't have this on any of the current stable branches (and it would be a pretty major change to backport the SSHD upgrade to the older branches).
          Hide
          elecharny Emmanuel Lecharny added a comment -

          I think this should be fixed with MINA 2.0.12. Can you give it a try ?

          Show
          elecharny Emmanuel Lecharny added a comment - I think this should be fixed with MINA 2.0.12. Can you give it a try ?
          Hide
          dougk Doug Kelly added a comment -

          I built a new version of Gerrit using MINA 2.0.13, however because of the issues that currently exist in SSHD (and are fixed by SSHD-639), I can't really test that this resolves the issue. Hopefully, the backport of the fix will be available on the 1.1.x branch soon, which would allow me to determine if this fix actually helps.

          Thanks for your assistance!

          Show
          dougk Doug Kelly added a comment - I built a new version of Gerrit using MINA 2.0.13, however because of the issues that currently exist in SSHD (and are fixed by SSHD-639 ), I can't really test that this resolves the issue. Hopefully, the backport of the fix will be available on the 1.1.x branch soon, which would allow me to determine if this fix actually helps. Thanks for your assistance!
          Hide
          lgoldstein Goldstein Lyor added a comment - - edited

          Can you try building and testing a version with the latest checked-in SSHD snapshot code (which includes the fix to SSHD-639) just so we know that it works ?

          Show
          lgoldstein Goldstein Lyor added a comment - - edited Can you try building and testing a version with the latest checked-in SSHD snapshot code (which includes the fix to SSHD-639 ) just so we know that it works ?
          Hide
          dougk Doug Kelly added a comment -

          I'd be glad to try; hopefully there's not a large number of API differences that may make this difficult.

          Show
          dougk Doug Kelly added a comment - I'd be glad to try; hopefully there's not a large number of API differences that may make this difficult.
          Hide
          dougk Doug Kelly added a comment -

          So far, it's appearing to work okay. I've not established the conditions where connections are leaked yet. However, I am noticing an increased amount of logging, with this message repeated dozens of times per connection:

          [2016-02-18 08:51:18,399] [NioProcessor-2] WARN  org.apache.mina.core.service.IoProcessor : Create a new selector. Selected is 0, delta = 0
          

          Additionally, after a while, I'm unable to connect to the server, and I see this exception in the logs:

          [2016-02-18 08:53:27,245] [NioSocketAcceptor-1] WARN  org.apache.mina.util.DefaultExceptionMonitor : Unexpected exception.
          java.lang.IllegalThreadStateException
                  at java.lang.ThreadGroup.addUnstarted(ThreadGroup.java:867)
                  at java.lang.Thread.init(Thread.java:402)
                  at java.lang.Thread.init(Thread.java:349)
                  at java.lang.Thread.<init>(Thread.java:675)
                  at org.apache.sshd.common.util.threads.ThreadUtils$SshdThreadFactory.newThread(ThreadUtils.java:177)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:612)
                  at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925)
                  at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368)
                  at org.apache.mina.core.polling.AbstractPollingIoProcessor.startupProcessor(AbstractPollingIoProcessor.java:441)
                  at org.apache.mina.core.polling.AbstractPollingIoProcessor.add(AbstractPollingIoProcessor.java:370)
                  at org.apache.mina.core.polling.AbstractPollingIoProcessor.add(AbstractPollingIoProcessor.java:68)
                  at org.apache.mina.core.service.SimpleIoProcessorPool.add(SimpleIoProcessorPool.java:261)
                  at org.apache.mina.core.service.SimpleIoProcessorPool.add(SimpleIoProcessorPool.java:80)
                  at org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.processHandles(AbstractPollingIoAcceptor.java:552)
                  at org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.run(AbstractPollingIoAcceptor.java:483)
                  at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                  at java.lang.Thread.run(Thread.java:745)
          

          This seems to happen if I let the server sit idle for several minutes. If I continuously connect, everything seems to continue to work fine. I've reproduced this with mina-core 2.0.13 and 2.0.12, as well as 2.0.10, which was in use with sshd-1.0.0, but had the problem I described earlier.

          I can try running some more tests (the most effective test seemed to be using iptables to drop packets instead of trying to break the client connection at a specific point) and let you know how it goes...

          Show
          dougk Doug Kelly added a comment - So far, it's appearing to work okay. I've not established the conditions where connections are leaked yet. However, I am noticing an increased amount of logging, with this message repeated dozens of times per connection: [2016-02-18 08:51:18,399] [NioProcessor-2] WARN org.apache.mina.core.service.IoProcessor : Create a new selector. Selected is 0, delta = 0 Additionally, after a while, I'm unable to connect to the server, and I see this exception in the logs: [2016-02-18 08:53:27,245] [NioSocketAcceptor-1] WARN org.apache.mina.util.DefaultExceptionMonitor : Unexpected exception. java.lang.IllegalThreadStateException at java.lang. ThreadGroup .addUnstarted( ThreadGroup .java:867) at java.lang. Thread .init( Thread .java:402) at java.lang. Thread .init( Thread .java:349) at java.lang. Thread .<init>( Thread .java:675) at org.apache.sshd.common.util.threads.ThreadUtils$SshdThreadFactory.newThread(ThreadUtils.java:177) at java.util.concurrent.ThreadPoolExecutor$Worker.<init>(ThreadPoolExecutor.java:612) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:925) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1368) at org.apache.mina.core.polling.AbstractPollingIoProcessor.startupProcessor(AbstractPollingIoProcessor.java:441) at org.apache.mina.core.polling.AbstractPollingIoProcessor.add(AbstractPollingIoProcessor.java:370) at org.apache.mina.core.polling.AbstractPollingIoProcessor.add(AbstractPollingIoProcessor.java:68) at org.apache.mina.core.service.SimpleIoProcessorPool.add(SimpleIoProcessorPool.java:261) at org.apache.mina.core.service.SimpleIoProcessorPool.add(SimpleIoProcessorPool.java:80) at org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.processHandles(AbstractPollingIoAcceptor.java:552) at org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.run(AbstractPollingIoAcceptor.java:483) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) This seems to happen if I let the server sit idle for several minutes. If I continuously connect, everything seems to continue to work fine. I've reproduced this with mina-core 2.0.13 and 2.0.12, as well as 2.0.10, which was in use with sshd-1.0.0, but had the problem I described earlier. I can try running some more tests (the most effective test seemed to be using iptables to drop packets instead of trying to break the client connection at a specific point) and let you know how it goes...
          Hide
          lgoldstein Goldstein Lyor added a comment -

          This may have something to do with SSHD-646 - I have reverted it (for now) - please get the latest version and try again - see if this IllegalThreadStateException problem still occurs

          Show
          lgoldstein Goldstein Lyor added a comment - This may have something to do with SSHD-646 - I have reverted it (for now) - please get the latest version and try again - see if this IllegalThreadStateException problem still occurs
          Hide
          dougk Doug Kelly added a comment -

          I've built the latest version of mina-sshd against mina-core 2.0.13, and I have managed to reproduce this issue at least once:

          $ ssh localhost -p 29418 gerrit show-connections
          Session     Start     Idle   User            Remote Host
          --------------------------------------------------------------
          b6219fea 11:08:25 00:00:09   dougk           ?
          f6d95718 11:08:34 00:00:00   dougk           localhost
          --
          SSHD Backend: mina
          

          The "?" is indicative of a session that's been closed, has no more remote host information, but hasn't been removed from the session list.

          In addition to seeing "already closed" exceptions in my log:

          [2016-02-18 11:08:31,221] [SSH git-upload-pack '/gerrit' (dougk)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user dougk acc
          ount 1000000) during git-upload-pack '/gerrit'
          org.apache.sshd.common.channel.WindowClosedException: Already closed: Window[server/remote](ChannelSession[id=0, recipient=0]-GerritServerSession[
          dougk@/127.0.0.1:49347])
                  at org.apache.sshd.common.channel.Window.waitForCondition(Window.java:311)
                  at org.apache.sshd.common.channel.Window.waitForSpace(Window.java:264)
                  at org.apache.sshd.common.channel.ChannelOutputStream.flush(ChannelOutputStream.java:165)
                  at org.apache.sshd.common.channel.ChannelOutputStream.write(ChannelOutputStream.java:119)
                  at org.eclipse.jgit.transport.SideBandOutputStream.writeBuffer(SideBandOutputStream.java:171)
                  at org.eclipse.jgit.transport.SideBandOutputStream.write(SideBandOutputStream.java:151)
                  at org.eclipse.jgit.internal.storage.pack.PackOutputStream.write(PackOutputStream.java:126)
                  at org.eclipse.jgit.internal.storage.file.ByteArrayWindow.write(ByteArrayWindow.java:89)
                  at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:248)
                  at org.eclipse.jgit.internal.storage.file.PackFile.copyPackAsIs(PackFile.java:355)
                  at org.eclipse.jgit.internal.storage.file.LocalCachedPack.copyAsIs(LocalCachedPack.java:85)
                  at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:236)
                  at org.eclipse.jgit.internal.storage.pack.PackWriter.writePack(PackWriter.java:1010)
                  at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1538)
                  at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1391)
                  at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:772)
                  at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:666)
                  at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:85)
                  at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101)
                  at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32)
                  at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70)
                  at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:437)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
                  at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:377)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                  at java.lang.Thread.run(Thread.java:745)
          

          I also had one instance of a "timeout exceeded" error:

          [2016-02-18 11:09:55,729] [SSH git-upload-pack '/gerrit' (dougk)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user dougk account 1000000) during git-upload-pack '/gerrit'
          java.net.SocketTimeoutException: waitForCondition(Window[server/remote](ChannelSession[id=0, recipient=0]-GerritServerSession[dougk@null])) timeout exceeded: 30000
                  at org.apache.sshd.common.channel.Window.waitForCondition(Window.java:314)
                  at org.apache.sshd.common.channel.Window.waitForSpace(Window.java:264)
                  at org.apache.sshd.common.channel.ChannelOutputStream.flush(ChannelOutputStream.java:165)
                  at org.apache.sshd.common.channel.ChannelOutputStream.write(ChannelOutputStream.java:119)
                  at org.eclipse.jgit.transport.SideBandOutputStream.writeBuffer(SideBandOutputStream.java:171)
                  at org.eclipse.jgit.transport.SideBandOutputStream.write(SideBandOutputStream.java:151)
                  at org.eclipse.jgit.internal.storage.pack.PackOutputStream.write(PackOutputStream.java:126)
                  at org.eclipse.jgit.internal.storage.file.ByteArrayWindow.write(ByteArrayWindow.java:89)
                  at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:248)
                  at org.eclipse.jgit.internal.storage.file.PackFile.copyPackAsIs(PackFile.java:355)
                  at org.eclipse.jgit.internal.storage.file.LocalCachedPack.copyAsIs(LocalCachedPack.java:85)
                  at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:236)
                  at org.eclipse.jgit.internal.storage.pack.PackWriter.writePack(PackWriter.java:1010)
                  at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1538)
                  at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1391)
                  at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:772)
                  at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:666)
                  at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:85)
                  at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101)
                  at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32)
                  at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70)
                  at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:437)
                  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
                  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
                  at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
                  at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:377)
                  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                  at java.lang.Thread.run(Thread.java:745)
          

          The hostname ("dougk@null") looks oddly suspicious there...

          But, the revert you made seems to have at least fixed the IllegalThreadStateException.

          Show
          dougk Doug Kelly added a comment - I've built the latest version of mina-sshd against mina-core 2.0.13, and I have managed to reproduce this issue at least once: $ ssh localhost -p 29418 gerrit show-connections Session Start Idle User Remote Host -------------------------------------------------------------- b6219fea 11:08:25 00:00:09 dougk ? f6d95718 11:08:34 00:00:00 dougk localhost -- SSHD Backend: mina The "?" is indicative of a session that's been closed, has no more remote host information, but hasn't been removed from the session list. In addition to seeing "already closed" exceptions in my log: [2016-02-18 11:08:31,221] [SSH git-upload-pack '/gerrit' (dougk)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user dougk acc ount 1000000) during git-upload-pack '/gerrit' org.apache.sshd.common.channel.WindowClosedException: Already closed: Window[server/remote](ChannelSession[id=0, recipient=0]-GerritServerSession[ dougk@/127.0.0.1:49347]) at org.apache.sshd.common.channel.Window.waitForCondition(Window.java:311) at org.apache.sshd.common.channel.Window.waitForSpace(Window.java:264) at org.apache.sshd.common.channel.ChannelOutputStream.flush(ChannelOutputStream.java:165) at org.apache.sshd.common.channel.ChannelOutputStream.write(ChannelOutputStream.java:119) at org.eclipse.jgit.transport.SideBandOutputStream.writeBuffer(SideBandOutputStream.java:171) at org.eclipse.jgit.transport.SideBandOutputStream.write(SideBandOutputStream.java:151) at org.eclipse.jgit.internal.storage.pack.PackOutputStream.write(PackOutputStream.java:126) at org.eclipse.jgit.internal.storage.file.ByteArrayWindow.write(ByteArrayWindow.java:89) at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:248) at org.eclipse.jgit.internal.storage.file.PackFile.copyPackAsIs(PackFile.java:355) at org.eclipse.jgit.internal.storage.file.LocalCachedPack.copyAsIs(LocalCachedPack.java:85) at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:236) at org.eclipse.jgit.internal.storage.pack.PackWriter.writePack(PackWriter.java:1010) at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1538) at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1391) at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:772) at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:666) at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:85) at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101) at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32) at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70) at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:437) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:377) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) I also had one instance of a "timeout exceeded" error: [2016-02-18 11:09:55,729] [SSH git-upload-pack '/gerrit' (dougk)] ERROR com.google.gerrit.sshd.BaseCommand : Internal server error (user dougk account 1000000) during git-upload-pack '/gerrit' java.net.SocketTimeoutException: waitForCondition(Window[server/remote](ChannelSession[id=0, recipient=0]-GerritServerSession[dougk@ null ])) timeout exceeded: 30000 at org.apache.sshd.common.channel.Window.waitForCondition(Window.java:314) at org.apache.sshd.common.channel.Window.waitForSpace(Window.java:264) at org.apache.sshd.common.channel.ChannelOutputStream.flush(ChannelOutputStream.java:165) at org.apache.sshd.common.channel.ChannelOutputStream.write(ChannelOutputStream.java:119) at org.eclipse.jgit.transport.SideBandOutputStream.writeBuffer(SideBandOutputStream.java:171) at org.eclipse.jgit.transport.SideBandOutputStream.write(SideBandOutputStream.java:151) at org.eclipse.jgit.internal.storage.pack.PackOutputStream.write(PackOutputStream.java:126) at org.eclipse.jgit.internal.storage.file.ByteArrayWindow.write(ByteArrayWindow.java:89) at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:248) at org.eclipse.jgit.internal.storage.file.PackFile.copyPackAsIs(PackFile.java:355) at org.eclipse.jgit.internal.storage.file.LocalCachedPack.copyAsIs(LocalCachedPack.java:85) at org.eclipse.jgit.internal.storage.file.WindowCursor.copyPackAsIs(WindowCursor.java:236) at org.eclipse.jgit.internal.storage.pack.PackWriter.writePack(PackWriter.java:1010) at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1538) at org.eclipse.jgit.transport.UploadPack.sendPack(UploadPack.java:1391) at org.eclipse.jgit.transport.UploadPack.service(UploadPack.java:772) at org.eclipse.jgit.transport.UploadPack.upload(UploadPack.java:666) at com.google.gerrit.sshd.commands.Upload.runImpl(Upload.java:85) at com.google.gerrit.sshd.AbstractGitCommand.service(AbstractGitCommand.java:101) at com.google.gerrit.sshd.AbstractGitCommand.access$000(AbstractGitCommand.java:32) at com.google.gerrit.sshd.AbstractGitCommand$1.run(AbstractGitCommand.java:70) at com.google.gerrit.sshd.BaseCommand$TaskThunk.run(BaseCommand.java:437) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at com.google.gerrit.server.git.WorkQueue$Task.run(WorkQueue.java:377) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang. Thread .run( Thread .java:745) The hostname ("dougk@null") looks oddly suspicious there... But, the revert you made seems to have at least fixed the IllegalThreadStateException.
          Hide
          lgoldstein Goldstein Lyor added a comment -

          The null is indeed suspicios - however, it seems to be a MINA issue - if you look at AbstractSession#toString() you will see that the string that follows the '@' is the IoSession#getRemoteAddress() value - which in this case seems to have been null-ified.

          Show
          lgoldstein Goldstein Lyor added a comment - The null is indeed suspicios - however, it seems to be a MINA issue - if you look at AbstractSession#toString() you will see that the string that follows the '@' is the IoSession#getRemoteAddress() value - which in this case seems to have been null-ified.
          Hide
          lgoldstein Goldstein Lyor added a comment -

          I see that the code is using a GerritServerSession instead of the default ServerSessionImpl - can you send me the link to the code for it so I can look at it and see if there are any suspicious things ?

          Show
          lgoldstein Goldstein Lyor added a comment - I see that the code is using a GerritServerSession instead of the default ServerSessionImpl - can you send me the link to the code for it so I can look at it and see if there are any suspicious things ?
          Hide
          dougk Doug Kelly added a comment -

          Ah yes, you're absolutely right. Sorry I missed your message today. Anyway, the code (updated to the latest snapshot of mina-sshd's master) is at https://gerrit-review.googlesource.com/75030. You should be able to clone the unmodified source with just git clone https://gerrit.googlesource.com/gerrit, or browse it online at the link. A direct link to the GerritServerSession is here.

          Show
          dougk Doug Kelly added a comment - Ah yes, you're absolutely right. Sorry I missed your message today. Anyway, the code (updated to the latest snapshot of mina-sshd's master) is at https://gerrit-review.googlesource.com/75030 . You should be able to clone the unmodified source with just git clone https://gerrit.googlesource.com/gerrit , or browse it online at the link. A direct link to the GerritServerSession is here .
          Hide
          lgoldstein Goldstein Lyor added a comment -

          Thanks for the information - it has inspired SSHD-652 (which I will commmit shortly) - so you won't have to use your own GerritServerSession (or others) just for registering the listener for the CloseFuture

          Show
          lgoldstein Goldstein Lyor added a comment - Thanks for the information - it has inspired SSHD-652 (which I will commmit shortly) - so you won't have to use your own GerritServerSession (or others) just for registering the listener for the CloseFuture
          Hide
          dougk Doug Kelly added a comment -

          No, thank you for the continued help!

          I'll also point out that we made use of the FileKeyPairProvider, and when it went away in 1.0.0, we just ported it over directly, here: https://gerrit-review.googlesource.com/73033

          Show
          dougk Doug Kelly added a comment - No, thank you for the continued help! I'll also point out that we made use of the FileKeyPairProvider, and when it went away in 1.0.0, we just ported it over directly, here: https://gerrit-review.googlesource.com/73033
          Hide
          lgoldstein Goldstein Lyor added a comment -

          Actually, the class you mention did not really go away - it was slightly modified and moved to SecurityUtils#createFileKeyPairProvider(). The method returns an AbstractFileKeyPairProvider with a setFiles(Collection<Path>) method. The behavior is exactly the same as FileKeyPairProvider - you still have to check though that SecurityUtils#isBouncyCastleRegistered() before using it.

          Show
          lgoldstein Goldstein Lyor added a comment - Actually, the class you mention did not really go away - it was slightly modified and moved to SecurityUtils#createFileKeyPairProvider() . The method returns an AbstractFileKeyPairProvider with a setFiles(Collection<Path>) method. The behavior is exactly the same as FileKeyPairProvider - you still have to check though that SecurityUtils#isBouncyCastleRegistered() before using it.
          Hide
          dougk Doug Kelly added a comment -

          I wondered about that. Seemed awfully strange for it to disappear; I'll probably switch us over to using that, then! Thanks for the tip!

          Show
          dougk Doug Kelly added a comment - I wondered about that. Seemed awfully strange for it to disappear; I'll probably switch us over to using that, then! Thanks for the tip!
          Hide
          lgoldstein Goldstein Lyor added a comment -

          In general I try not to make useful things disappear - I may move them (I try to keep such movements to a minimum after 1.0, but from time to time it happens). Therefore, if anything seems to "disappear" I recommend posting a question on the dev@mina.apache.org mailing list before porting or re-implementing it. Also, as a rule of thumb I would recommend trying to compile Gerrit from time to time with the latest version from Github. This would help detect such "disappearances" earlier and/or expose bugs / missing features / etc...

          Show
          lgoldstein Goldstein Lyor added a comment - In general I try not to make useful things disappear - I may move them (I try to keep such movements to a minimum after 1.0, but from time to time it happens). Therefore, if anything seems to "disappear" I recommend posting a question on the dev@mina.apache.org mailing list before porting or re-implementing it. Also, as a rule of thumb I would recommend trying to compile Gerrit from time to time with the latest version from Github . This would help detect such "disappearances" earlier and/or expose bugs / missing features / etc...
          Hide
          cwicklow Carl Wicklow added a comment -

          In case it helps we see similar behavior in our application. Our IoHandler may not always see a sessionClosed (using Mina 2.0.0.13).

          My analysis suggests that the destroy(session) here ....

          AbstractPollingIoProcessor.writeBuffer()
                     try {
                          localWrittenBytes = write(session, buf, length);
                      } catch (IOException ioe) {
                          // We have had an issue while trying to send data to the
                          // peer : let's close the session.
                          buf.free();
                          session.close(true);
                          destroy(session);  // HERE
                          return 0;
                      }
          

          ... means that by the time the scheduled remove actually occurs, the session state is CLOSING (because the destroy(session) invalidates the selection key).

          Most often, we see a "Connection Reset By Peer" during the read that occurs before the write, and can deal with it in our exception handler.

          We'll sometimes see an inputClosed() (TCP half-close) - but neither closeNow() or closeOnFlush() will trigger a sessionClosed in the event that the write (which happens after the read) catches an exception (see also DIRMINA-1025, where the write may occur after a read that triggers an inputClosed()) - but, again, we have a chance here to handle it in our IoHandler.

          But when an exception is first caught during the writeBuffer() call, we get no sessionClosed(), the closeFuture will never complete, and any pending writeFutures are left out in the cold too.

          Show
          cwicklow Carl Wicklow added a comment - In case it helps we see similar behavior in our application. Our IoHandler may not always see a sessionClosed (using Mina 2.0.0.13). My analysis suggests that the destroy(session) here .... AbstractPollingIoProcessor.writeBuffer() try { localWrittenBytes = write(session, buf, length); } catch (IOException ioe) { // We have had an issue while trying to send data to the // peer : let's close the session. buf.free(); session.close( true ); destroy(session); // HERE return 0; } ... means that by the time the scheduled remove actually occurs, the session state is CLOSING (because the destroy(session) invalidates the selection key). Most often, we see a "Connection Reset By Peer" during the read that occurs before the write, and can deal with it in our exception handler. We'll sometimes see an inputClosed() (TCP half-close) - but neither closeNow() or closeOnFlush() will trigger a sessionClosed in the event that the write (which happens after the read) catches an exception (see also DIRMINA-1025 , where the write may occur after a read that triggers an inputClosed()) - but, again, we have a chance here to handle it in our IoHandler. But when an exception is first caught during the writeBuffer() call, we get no sessionClosed(), the closeFuture will never complete, and any pending writeFutures are left out in the cold too.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Thanks a bunch for the very clear description !

          The thing is that you may initiate a session close when at the same time the remote peer does also close the session. The exception you get will not terminate the future, and that is a problem. Now, you can probably handle the session cleanup in the exceptionCaught() method in the IoHandler. Have you checked that ?

          Show
          elecharny Emmanuel Lecharny added a comment - Thanks a bunch for the very clear description ! The thing is that you may initiate a session close when at the same time the remote peer does also close the session. The exception you get will not terminate the future, and that is a problem. Now, you can probably handle the session cleanup in the exceptionCaught() method in the IoHandler. Have you checked that ?
          Hide
          cwicklow Carl Wicklow added a comment -

          I don't see a code path that gets that IoException (e.g. "Broken Pipe" caught in the writeBuffer() method) to our IoHandler. Can you point me in the right direction ?

          Show
          cwicklow Carl Wicklow added a comment - I don't see a code path that gets that IoException (e.g. "Broken Pipe" caught in the writeBuffer() method) to our IoHandler. Can you point me in the right direction ?
          Hide
          elecharny Emmanuel Lecharny added a comment -

          If you get an IOException in the writeBuffer method, it will return 0. The AbstractPollingIoProcessor.flushNow() method will then set the selectorKey to be interested in Write and return. I wonder if the setInterestedInWrite() method might not throw an exception if the selectionKey has been cancelled in the mean time...

          Show
          elecharny Emmanuel Lecharny added a comment - If you get an IOException in the writeBuffer method, it will return 0. The AbstractPollingIoProcessor.flushNow() method will then set the selectorKey to be interested in Write and return. I wonder if the setInterestedInWrite() method might not throw an exception if the selectionKey has been cancelled in the mean time...
          Hide
          cwicklow Carl Wicklow added a comment -

          Thanks, I see what you mean, but no exception is thrown. As I see it, the destroy(session) cancelled the key before writeBuffer returned 0, so this ...

          NioProcessor.setInterestedInWrite()
                     SelectionKey key = session.getSelectionKey();
          
                  if ((key == null) || !key.isValid()) {
                      return;
                  }
          
          

          ... throws no exception. AFACIT, our IoHandler is not notified at all, but I'll do some more digging later today.

          Show
          cwicklow Carl Wicklow added a comment - Thanks, I see what you mean, but no exception is thrown. As I see it, the destroy(session) cancelled the key before writeBuffer returned 0, so this ... NioProcessor.setInterestedInWrite() SelectionKey key = session.getSelectionKey(); if ((key == null ) || !key.isValid()) { return ; } ... throws no exception. AFACIT, our IoHandler is not notified at all, but I'll do some more digging later today.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          From the code, either you get an IOException when trying to write the buffer, this exception will be caught, and processed so that the session will be closed and destroyed. In this case, you should get a sessionClosed event propagated to the IoHandler.

          If this is not what you get, then either :

          • the buffer is correctly written. Not the scenario you are interested in
          • you've got another exception than IOException (ie NonWritableChannelException, all the other are extending IOException)

          This is an unchecked exception. The best would be to add some log for this specific case, to see if you enter into this use case.

          Show
          elecharny Emmanuel Lecharny added a comment - From the code, either you get an IOException when trying to write the buffer, this exception will be caught, and processed so that the session will be closed and destroyed. In this case, you should get a sessionClosed event propagated to the IoHandler . If this is not what you get, then either : the buffer is correctly written. Not the scenario you are interested in you've got another exception than IOException (ie NonWritableChannelException , all the other are extending IOException ) This is an unchecked exception. The best would be to add some log for this specific case, to see if you enter into this use case.
          Hide
          cwicklow Carl Wicklow added a comment - - edited

          Thanks Emmanuel.

          I wasn't clear when I said "no exception is thrown". I meant that only in the context of what happens in setInterestedInWrite().

          In terms of the writeBuffer() call, an exception certainly is caught ....

          AbstractPollingIoProcessor.writeBuffer()
                     try {
                          localWrittenBytes = write(session, buf, length);
                      } catch (IOException ioe) {
                          // IOException("Broken Pipe") is caught ...
                          buf.free();
                          session.close(true);
                          destroy(session); 
                          return 0;
                      }
          

          The session.close(true) passes up the filter chain (as you showed in an earlier post), and the session gets scheduled for removal ...

          AbstractPollingIoProcessor.remove()
              public final void remove(S session) {
                  scheduleRemove(session);
                  startupProcessor();
              }
          
              private void scheduleRemove(S session) {
                  removingSessions.add(session);
              }
          

          The destroy(session) called from writeBuffer() then cancels the selection key for this session (session state is now CLOSING).

          NioProcessor
              @Override
              protected void destroy(NioSession session) throws Exception {
                  ByteChannel ch = session.getChannel();
                  SelectionKey key = session.getSelectionKey();
                  if (key != null) {
                      key.cancel();
                  }
                  ch.close();
              }
          

          The writeBuffer() returns 0, the flushNow returns false from here ...

          AbstractPollingIoProcessor.flushNow()
                          if (localWrittenBytes == 0) {
                              setInterestedInWrite(session, true);
                              return false;
                          }
          

          Which pops us back into the processor run loop ...

          AbstractPollingIoProcessor.Processor.run()
                             // Write the pending requests
                              long currentTime = System.currentTimeMillis();
                              flush(currentTime);
          
                              // And manage removed sessions
                              nSessions -= removeSessions();
          

          ... ready to remove the sessions. Our session is CLOSING now, though, so the removal is not propagated ...

          AbstractPollingIoProcessor.flushNow()
                          case CLOSING:
                              // Skip if channel is already closed
                              // In any case, remove the session from the queue
                              removedSessions++;
                              break;
          
          

          ... and that means the CloseFuture never completes, nor does the WriteFuture for the write that threw the exception, and no exception has been propogated to the IoHandler.

          Am still reviewing the rest (there's a lot to go through !), but I think this may be a path thru the code where the IoHandler doesn't get told about the exception.

          Show
          cwicklow Carl Wicklow added a comment - - edited Thanks Emmanuel. I wasn't clear when I said "no exception is thrown". I meant that only in the context of what happens in setInterestedInWrite(). In terms of the writeBuffer() call, an exception certainly is caught .... AbstractPollingIoProcessor.writeBuffer() try { localWrittenBytes = write(session, buf, length); } catch (IOException ioe) { // IOException( "Broken Pipe" ) is caught ... buf.free(); session.close( true ); destroy(session); return 0; } The session.close(true) passes up the filter chain (as you showed in an earlier post), and the session gets scheduled for removal ... AbstractPollingIoProcessor.remove() public final void remove(S session) { scheduleRemove(session); startupProcessor(); } private void scheduleRemove(S session) { removingSessions.add(session); } The destroy(session) called from writeBuffer() then cancels the selection key for this session (session state is now CLOSING). NioProcessor @Override protected void destroy(NioSession session) throws Exception { ByteChannel ch = session.getChannel(); SelectionKey key = session.getSelectionKey(); if (key != null ) { key.cancel(); } ch.close(); } The writeBuffer() returns 0, the flushNow returns false from here ... AbstractPollingIoProcessor.flushNow() if (localWrittenBytes == 0) { setInterestedInWrite(session, true ); return false ; } Which pops us back into the processor run loop ... AbstractPollingIoProcessor.Processor.run() // Write the pending requests long currentTime = System .currentTimeMillis(); flush(currentTime); // And manage removed sessions nSessions -= removeSessions(); ... ready to remove the sessions. Our session is CLOSING now, though, so the removal is not propagated ... AbstractPollingIoProcessor.flushNow() case CLOSING: // Skip if channel is already closed // In any case , remove the session from the queue removedSessions++; break ; ... and that means the CloseFuture never completes, nor does the WriteFuture for the write that threw the exception, and no exception has been propogated to the IoHandler. Am still reviewing the rest (there's a lot to go through !), but I think this may be a path thru the code where the IoHandler doesn't get told about the exception.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          FTR, I'm conducting a full review on the session close. I already completed the use case where we explicitely call NioSession.closeNow(), and it's pretty complex. I have found at least one thing that need to be fixed : we call the closeNow method a second time at the very end, which is totally useless.

          Otherwise, the real problem with this method is that the session is not actually closed now, but later on. Let me explain...

          In order to really close the session, a few things need to be done :

          • change the sessions state to 'closing'
          • empty the message waiting to be written queue
          • cancel the associated selectionKey
          • close the associated Channel

          The two last steps are done at the very end of the session's closing. IMO, that should be the first thing to do, in order to avoid any incoming message to be processed, because what happens is that when the first steps are completed, the current thread has just pushed the session into a queue of session to be removed, and this will be done only when we will loop on the select() method again, and we may have to process incoming messages for this session, which is utter nonsensical.

          We can't change that at the moment, because the NioSession.getSelectiorKey and NioSession.getChannel methods are package protected. But I do think that the real issue is that there is no method in the IoSession that can be called from closeNow method to disconnect the session. I'm going to add such a call.

          I still have to analyze the closeOnFlush() method, then there are other use cases : typically when the connection is closed properly on the client side, or when we get an exception on a session. That will be the next things I'll review.

          Show
          elecharny Emmanuel Lecharny added a comment - FTR, I'm conducting a full review on the session close. I already completed the use case where we explicitely call NioSession.closeNow() , and it's pretty complex. I have found at least one thing that need to be fixed : we call the closeNow method a second time at the very end, which is totally useless. Otherwise, the real problem with this method is that the session is not actually closed now , but later on. Let me explain... In order to really close the session, a few things need to be done : change the sessions state to 'closing' empty the message waiting to be written queue cancel the associated selectionKey close the associated Channel The two last steps are done at the very end of the session's closing. IMO, that should be the first thing to do, in order to avoid any incoming message to be processed, because what happens is that when the first steps are completed, the current thread has just pushed the session into a queue of session to be removed, and this will be done only when we will loop on the select() method again, and we may have to process incoming messages for this session, which is utter nonsensical. We can't change that at the moment, because the NioSession.getSelectiorKey and NioSession.getChannel methods are package protected. But I do think that the real issue is that there is no method in the IoSession that can be called from closeNow method to disconnect the session. I'm going to add such a call. I still have to analyze the closeOnFlush() method, then there are other use cases : typically when the connection is closed properly on the client side, or when we get an exception on a session. That will be the next things I'll review.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          I have slightly modified the way the session closure is done, by immediately destroying the channel and the selectionKey instead of waiting. That may help solving the issue we are having here.

          if someone is ready to give this a kick, I can provide a package.

          Just let me know.

          Show
          elecharny Emmanuel Lecharny added a comment - I have slightly modified the way the session closure is done, by immediately destroying the channel and the selectionKey instead of waiting. That may help solving the issue we are having here. if someone is ready to give this a kick, I can provide a package. Just let me know.
          Hide
          cwicklow Carl Wicklow added a comment -

          Happy to give it a try Emmanuel (email is in my profile, if that's the best way to send the package).

          Show
          cwicklow Carl Wicklow added a comment - Happy to give it a try Emmanuel (email is in my profile, if that's the best way to send the package).
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Carl,

          I just attached the latest package (mina-core) to the ticket.

          feel free to test it !

          Show
          elecharny Emmanuel Lecharny added a comment - Carl, I just attached the latest package (mina-core) to the ticket. feel free to test it !
          Hide
          cwicklow Carl Wicklow added a comment -

          This doesn't improve things from our point of view, i.e. we still have occasions where the session gets closed, but that closure is not propagated to the IoHandler, but perhaps that's not what you are attempting to fix.

          Show
          cwicklow Carl Wicklow added a comment - This doesn't improve things from our point of view, i.e. we still have occasions where the session gets closed, but that closure is not propagated to the IoHandler, but perhaps that's not what you are attempting to fix.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Actually, I was reviewing the various close methods in IoSession. I found that when teh closeNow() method was called, the underlying connection was not closed immediately, something that would potentially have some side effect, as some messages may be processed between the moment the closeNow() method was called, and the moment the effective closure is done.

          But this is not the bottom of my review : I still have to check what's going on when an exception occurs somewhere, because it ends with a close being done, and I think this is where you have some issue.

          I hope to get to the bottom of this problem soon, and to cut a 2.0.14 release, so bear with me in teh coming days...

          Show
          elecharny Emmanuel Lecharny added a comment - Actually, I was reviewing the various close methods in IoSession. I found that when teh closeNow() method was called, the underlying connection was not closed immediately, something that would potentially have some side effect, as some messages may be processed between the moment the closeNow() method was called, and the moment the effective closure is done. But this is not the bottom of my review : I still have to check what's going on when an exception occurs somewhere, because it ends with a close being done, and I think this is where you have some issue. I hope to get to the bottom of this problem soon, and to cut a 2.0.14 release, so bear with me in teh coming days...
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Ok, I think I see a scenario where the CloseFuture actually never get terminated.

          The writeBuffer call stack is like :

          Processor.run
            flush (OPENED)
              flushNow
                writeBuffer
          

          If one exception occurs in the writeBuffer, we do try to close the session :

                      } catch (IOException ioe) {
                          // We have had an issue while trying to send data to the
                          // peer : let's close the session.
                          buf.free();
                          session.closeNow();
                          destroy(session);
          
                          return 0;
                      }
          

          At this point, the CloseFuture that is returned by the session.closeNow() call is simply not kept anywhere (which is not necessarily an issue, because we intend to close the session immediately), and it will be terminated by a call to the fireSessionClosed() method, itself called by the fireSessionDestroyed() method, called by the removeNow() method, called by the removeSessions() method, ultimately called by the main Processor.run() method. It requires the session to have been added to the removingSessions queue, which should be the case as we called the closeNow() method. However, and you pointed it out before, the session might be in a CLOSING state, and in this case, we never call the removeNow() method, which call the fireSessionDestroyed() method...

          At this poiint, I think that modifying the writeBuffer method to do :

                      ...
                      try {
                          localWrittenBytes = write(session, buf, length);
                      } catch (IOException ioe) {
                          // We have had an issue while trying to send data to the
                          // peer : let's close the session.
                          buf.free();
                          session.closeNow();
                          removeNow(session);
          
                          return 0;
                      }
                  ...
          

          should pretty much cover this corner case. I have attached the latest build containing this fix, would you like to test it ?

          Show
          elecharny Emmanuel Lecharny added a comment - Ok, I think I see a scenario where the CloseFuture actually never get terminated. The writeBuffer call stack is like : Processor.run flush (OPENED) flushNow writeBuffer If one exception occurs in the writeBuffer , we do try to close the session : } catch (IOException ioe) { // We have had an issue while trying to send data to the // peer : let's close the session. buf.free(); session.closeNow(); destroy(session); return 0; } At this point, the CloseFuture that is returned by the session.closeNow() call is simply not kept anywhere (which is not necessarily an issue, because we intend to close the session immediately), and it will be terminated by a call to the fireSessionClosed() method, itself called by the fireSessionDestroyed() method, called by the removeNow() method, called by the removeSessions() method, ultimately called by the main Processor.run() method. It requires the session to have been added to the removingSessions queue, which should be the case as we called the closeNow() method. However, and you pointed it out before, the session might be in a CLOSING state, and in this case, we never call the removeNow() method, which call the fireSessionDestroyed() method... At this poiint, I think that modifying the writeBuffer method to do : ... try { localWrittenBytes = write(session, buf, length); } catch (IOException ioe) { // We have had an issue while trying to send data to the // peer : let's close the session. buf.free(); session.closeNow(); removeNow(session); return 0; } ... should pretty much cover this corner case. I have attached the latest build containing this fix, would you like to test it ?
          Hide
          cwicklow Carl Wicklow added a comment -

          Thanks, I'll give it a try ...

          Show
          cwicklow Carl Wicklow added a comment - Thanks, I'll give it a try ...
          Hide
          cwicklow Carl Wicklow added a comment -

          Thanks Emmanuel, that fix works for us !

          Show
          cwicklow Carl Wicklow added a comment - Thanks Emmanuel, that fix works for us !
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Great !!! It took me a f*g lot of time to figure out what was going on... This review was not a waste of time though.

          I also have to say that you helped a lot by providing not only an accurate report of the problem (and I guess you also spent a lot of time debugging MINA), but also by testing the patched versions

          I will cut a new MINA release soon.

          Show
          elecharny Emmanuel Lecharny added a comment - Great !!! It took me a f*g lot of time to figure out what was going on... This review was not a waste of time though. I also have to say that you helped a lot by providing not only an accurate report of the problem (and I guess you also spent a lot of time debugging MINA), but also by testing the patched versions I will cut a new MINA release soon.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Should have been fixed with commit 3b6c44642bdd07b2fb6b40015b62d7c6b5450dea

          Show
          elecharny Emmanuel Lecharny added a comment - Should have been fixed with commit 3b6c44642bdd07b2fb6b40015b62d7c6b5450dea

            People

            • Assignee:
              Unassigned
              Reporter:
              dougk Doug Kelly
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development