MINA
  1. MINA
  2. DIRMINA-708

Unbound thread growth on discovery attempts : Improve documentation

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-M5
    • Fix Version/s: 2.0.0
    • Component/s: None
    • Labels:
      None

      Description

      Using several different possible protocol handlers (handler, codecs, et al) on several potential ports (serial and/or network), our code attempts to discover devices on the far end and attach the proper decoders on the proper ports.

      Generally, this process is working, however, every failed discovery attempt has launched roughly 6 threads, but only destroyed roughly 4 of them. The 2 delta threads sit in WAIT states forever.

      More details are available in the "Unbound thread growth" thread of the listserv.

        Activity

        Hide
        boB Gage added a comment -

        JIP output from short execution.

        Four different discovery cycles appear in the run, at time deltas: 0:27, 0:55, 2:06, and 2:42.

        Manually halted before fourth discovery attempt timed out. ( Each protocol defines different time out rates, that last one was going to take several minutes )

        Show
        boB Gage added a comment - JIP output from short execution. Four different discovery cycles appear in the run, at time deltas: 0:27, 0:55, 2:06, and 2:42. Manually halted before fourth discovery attempt timed out. ( Each protocol defines different time out rates, that last one was going to take several minutes )
        Hide
        boB Gage added a comment -

        It has been over two weeks since I collected profile data and attached it to this ticket as requested.

        Is anything going to be done about this or should I be digging into the Mina code trying to fix it myself???

        Show
        boB Gage added a comment - It has been over two weeks since I collected profile data and attached it to this ticket as requested. Is anything going to be done about this or should I be digging into the Mina code trying to fix it myself???
        Hide
        Emmanuel Lecharny added a comment -

        Hi,

        I have looked at the traces, but as you can imagine, it's far from being easy to analyze what's going on. Since you attached it, a major bug has been fixed, which took me 3 full days (DIRMINA-712).

        I understand that it's a major issue for you, but you may also consider that as volunteers, we have a limited amount of time to work on the project.

        As a side effect, yes, sure, if you want to dig into the code, and find a fix, that would be greatly appreciated ! Anyone can help, this is basically how it works...

        Show
        Emmanuel Lecharny added a comment - Hi, I have looked at the traces, but as you can imagine, it's far from being easy to analyze what's going on. Since you attached it, a major bug has been fixed, which took me 3 full days ( DIRMINA-712 ). I understand that it's a major issue for you, but you may also consider that as volunteers, we have a limited amount of time to work on the project. As a side effect, yes, sure, if you want to dig into the code, and find a fix, that would be greatly appreciated ! Anyone can help, this is basically how it works...
        Hide
        Emmanuel Lecharny added a comment -

        A quick look at the code shows that if the thread is locked, it's because the flush() method is not called. Let's walk the code :

        We have a writeMonitor which is used in two places ; the WriteWorker and the SerialIoProcessor.

        • In the WriteWorker :

        ...
        // wait for more data
        synchronized (writeMonitor) {
        try

        { writeMonitor.wait(); }

        catch (InterruptedException e)

        { log.error("InterruptedException", e); }

        }
        ...

        • in the SerialIoProcessor :
          ...
          public void flush(SerialSessionImpl session) {
          if (writeWorker == null) { writeWorker = new WriteWorker(); writeWorker.start(); }

          else

          Unknown macro: { synchronized (writeMonitor) { writeMonitor.notifyAll(); } }

          }
          ...

        This flush method is called by the SerialIoProcessor.remove() method, and the flush method is the only place the WriterWorker is created.

        Now note the sequence :
        1) we start the SerialSessionImpl
        2) It creates everything, and the remove() method is called when there is an exception
        3) The remove() method do some cleaning, and call the flush() method (this is the only place this method is called)
        4) The flush method create a new WriteWorker, as this is the only place we create it in the code
        5) The WriteWorker is started, calls flushWrites()
        6) When this methods returns, the WriteMonitor wait for notification
        7) As soon as the SerialIoProcessor.remove() method will be called, the flush(à method will be called and it will notify the WriteMonitor, releasing the thread, if it were blocked on a wait().

        The remove() is invoked when the session is closed. I don't have the code of your application, so I can't tell if you close the session, and when. Also there is something strange I see in the traces : if you check the thread 30, you can see that the SerialSessionImpl.flushWrites() method is called, and end with a call to PdrProtocolEngine.messageSent() method, which does not return. I'm wondering if this method is not supposed to timeout at some point, and then close the session, if it can't write to the client.

        This is a bit difficult to say what'ss really going in in your app, as I don't see the code, so consider those elements as semi blind guesses ...

        Show
        Emmanuel Lecharny added a comment - A quick look at the code shows that if the thread is locked, it's because the flush() method is not called. Let's walk the code : We have a writeMonitor which is used in two places ; the WriteWorker and the SerialIoProcessor. In the WriteWorker : ... // wait for more data synchronized (writeMonitor) { try { writeMonitor.wait(); } catch (InterruptedException e) { log.error("InterruptedException", e); } } ... in the SerialIoProcessor : ... public void flush(SerialSessionImpl session) { if (writeWorker == null) { writeWorker = new WriteWorker(); writeWorker.start(); } else Unknown macro: { synchronized (writeMonitor) { writeMonitor.notifyAll(); } } } ... This flush method is called by the SerialIoProcessor.remove() method, and the flush method is the only place the WriterWorker is created. Now note the sequence : 1) we start the SerialSessionImpl 2) It creates everything, and the remove() method is called when there is an exception 3) The remove() method do some cleaning, and call the flush() method (this is the only place this method is called) 4) The flush method create a new WriteWorker, as this is the only place we create it in the code 5) The WriteWorker is started, calls flushWrites() 6) When this methods returns, the WriteMonitor wait for notification 7) As soon as the SerialIoProcessor.remove() method will be called, the flush(à method will be called and it will notify the WriteMonitor, releasing the thread, if it were blocked on a wait(). The remove() is invoked when the session is closed. I don't have the code of your application, so I can't tell if you close the session, and when. Also there is something strange I see in the traces : if you check the thread 30, you can see that the SerialSessionImpl.flushWrites() method is called, and end with a call to PdrProtocolEngine.messageSent() method, which does not return. I'm wondering if this method is not supposed to timeout at some point, and then close the session, if it can't write to the client. This is a bit difficult to say what'ss really going in in your app, as I don't see the code, so consider those elements as semi blind guesses ...
        Hide
        boB Gage added a comment -

        I'm unsure as to how anything could get stuck inside PdrProtocolEngine.messageSent(), the entire method is reproduced here:

        public abstract class PdrProtocolEngine extends IoHandlerAdapter {
        ...
        @Override
        public final void messageSent(IoSession session, Object message)

        { logger.debug("messageSent(session, " + message + ")"); }

        ...
        }

        Is there some step I've missed that I should be doing in the overridden method?? This particular one was a placeholder we never went anywhere with; I'll try simply removing it to see if the problem disappears.


        I do recall one of the changes we had to do when updating to M4 (I think we were at M2 before that?) – we had been using IoSession.closeOnFlush() which got removed so we had to drop back to a straight close() method.

        Since then our code has called close methods, but there is no flush anywhere that we use or that I am aware of.

        Matter of fact, a search for flush through our project uncovers only my notes from that upgrade and from chasing this bug – we have zero specific flush calls in our code.

        Is this a problem?? I assumed Mina will flush the sessions it is closing – or at least trigger whatever threads it may have launched to control that session. Is there something more I should do beyond calling the IoSesison.close() method to bring down a session that fails discovery??


        Digging now... more later

        Show
        boB Gage added a comment - I'm unsure as to how anything could get stuck inside PdrProtocolEngine.messageSent(), the entire method is reproduced here: public abstract class PdrProtocolEngine extends IoHandlerAdapter { ... @Override public final void messageSent(IoSession session, Object message) { logger.debug("messageSent(session, " + message + ")"); } ... } Is there some step I've missed that I should be doing in the overridden method?? This particular one was a placeholder we never went anywhere with; I'll try simply removing it to see if the problem disappears. I do recall one of the changes we had to do when updating to M4 (I think we were at M2 before that?) – we had been using IoSession.closeOnFlush() which got removed so we had to drop back to a straight close() method. Since then our code has called close methods, but there is no flush anywhere that we use or that I am aware of. Matter of fact, a search for flush through our project uncovers only my notes from that upgrade and from chasing this bug – we have zero specific flush calls in our code. Is this a problem?? I assumed Mina will flush the sessions it is closing – or at least trigger whatever threads it may have launched to control that session. Is there something more I should do beyond calling the IoSesison.close() method to bring down a session that fails discovery?? Digging now... more later
        Hide
        Emmanuel Lecharny added a comment -

        Seems that the app is not blocked in the PdrProtocolEngine.messageSent() method, but more likely on the WriterMonitor wait then. Which makes sense ...

        So I guess that this call to the remove() method is mandatory at some point, in order to get the flush() method to be called. This part of the code is not one I'm comfortable with, as I never used it. I guess that closing the session should either call the remove() method before or do it inside the close() handling at some point.

        Show
        Emmanuel Lecharny added a comment - Seems that the app is not blocked in the PdrProtocolEngine.messageSent() method, but more likely on the WriterMonitor wait then. Which makes sense ... So I guess that this call to the remove() method is mandatory at some point, in order to get the flush() method to be called. This part of the code is not one I'm comfortable with, as I never used it. I guess that closing the session should either call the remove() method before or do it inside the close() handling at some point.
        Hide
        boB Gage added a comment - - edited

        Uh oh... I think I fixed it.

        The documentation for IoSession tells us:

        CloseFuture close(boolean immediately)

        Closes this session immediately or after all queued write requests are flushed. This operation is asynchronous. Wait for the returned CloseFuture if you want to wait for the session actually closed.

        Parameters:
        immediately - true to close this session immediately (i.e. close()). false to close this session after all queued write requests are flushed (i.e. #closeOnFlush()).


        We want all outgoing data to be flushed, so we had:

        CloseFuture cf = session.close(false); // Mina M2->M4 change, was: session.closeOnFlush();
        cf.await(closeTimeout, TimeUnit.SECONDS);
        if ( cf.isClosed() )

        { ... }

        and observe all of the thread growth behaviour discussed here and in the listserv.

        Changing the false close() parameter to true eliminates the thread growth. [ Actually it limits it – the first cycle through all of the possible protocols has an upward tilt, but after that the thread graph's a nice flat line with bumps. ]

        Okay, I'll admit it – I didn't fix it, I just found a way to avoid it. 'sgood enough for me.

        Show
        boB Gage added a comment - - edited Uh oh... I think I fixed it. The documentation for IoSession tells us: CloseFuture close(boolean immediately) Closes this session immediately or after all queued write requests are flushed. This operation is asynchronous. Wait for the returned CloseFuture if you want to wait for the session actually closed. Parameters: immediately - true to close this session immediately (i.e. close()). false to close this session after all queued write requests are flushed (i.e. #closeOnFlush()). We want all outgoing data to be flushed, so we had: CloseFuture cf = session.close(false); // Mina M2->M4 change, was: session.closeOnFlush(); cf.await(closeTimeout, TimeUnit.SECONDS); if ( cf.isClosed() ) { ... } and observe all of the thread growth behaviour discussed here and in the listserv. Changing the false close() parameter to true eliminates the thread growth. [ Actually it limits it – the first cycle through all of the possible protocols has an upward tilt, but after that the thread graph's a nice flat line with bumps. ] Okay, I'll admit it – I didn't fix it, I just found a way to avoid it. 'sgood enough for me.
        Hide
        Emmanuel Lecharny added a comment -

        Great !

        I suggest then we lower the issue to 'Task' with a 'major' level, changing the title to 'Improve documentation on the session removal'.

        Show
        Emmanuel Lecharny added a comment - Great ! I suggest then we lower the issue to 'Task' with a 'major' level, changing the title to 'Improve documentation on the session removal'.
        Hide
        Emmanuel Lecharny added a comment -

        Affected to 2.0.0-RC1

        Show
        Emmanuel Lecharny added a comment - Affected to 2.0.0-RC1
        Hide
        Emmanuel Lecharny added a comment -

        moved to 2.0.0

        Show
        Emmanuel Lecharny added a comment - moved to 2.0.0
        Hide
        Emmanuel Lecharny added a comment -

        Added some info in the IoSession.close( immediately ) method's Javadoc.

        Show
        Emmanuel Lecharny added a comment - Added some info in the IoSession.close( immediately ) method's Javadoc.
        Hide
        Emmanuel Lecharny added a comment -

        Closing issues

        Show
        Emmanuel Lecharny added a comment - Closing issues

          People

          • Assignee:
            Unassigned
            Reporter:
            boB Gage
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development