Uploaded image for project: 'MINA'
  1. MINA
  2. DIRMINA-1019

SslHandler flushScheduledEvents race condition

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.9
    • Fix Version/s: 2.0.13
    • Component/s: SSL
    • Labels:
      None

      Description

      From what I've seen, the SslFilter class schedules events onto the SslHandler and then flushes them via the SslHandler.flushScheduledEvents() method.

      Within SslHandler.flushScheduledEvents() the lock can only be accumulated by a single thread, and all other threads that try to accumulate the lock instead follow through and not block waiting to get the lock.

      ...
      if (sslLock.tryLock()) {
      ...

      This leads to the race condition where a thread may call SslHandler.scheduleFilterWrite() followed by SslHandler.flushScheduledEvents() while another thread holds the sslLock and has finished dequeuing the SslHandler.filterWriteEventQueue and is currently dequeuing the SslHandler.messageReceivedEventQueue.

      I've actually hit this race condition quite a few times today and have added in a small fix which essentially tracks the number of times SslHandler.flushScheduledEvents() has been called. There's probably a much better solution but yeah just letting you guys know what I've found.

      Thanks,
      Terence

      1. SslFilterTest.java
        4 kB
        Terence Marks
      2. fix.java
        1 kB
        Terence Marks

        Activity

        Hide
        cwicklow Carl Wicklow added a comment - - edited

        We've just updated to Mina 2.0.13 (prompted by SSL re-ordering data, as fixed by https://issues.apache.org/jira/browse/DIRMINA-779).

        We're now bumping into an issue that looks like it's due to this (DIRMINA-1019) - our application can (rarely) stall - I see the data come off the network, it goes up the filter chain into the SslFilter, but never goes up further.

        I have a workaround (we send a ping down the session to flush things through), but thought I'd check if you folks thought it was an application issue or a Mina issue.

        Here's roughly what I see in SslHandler.flushScheduledEvents() :

        SslHandler.java
        	void flushScheduledEvents() {
        		
                scheduled_events.incrementAndGet();
        
                if (sslLock.tryLock()) {
                    
                    try {
                        do {
                        	IoFilterEvent event;
                            
                            // ... and so on ....
                        	
                        } while (scheduled_events.decrementAndGet() > 0);
                        
                        // Time passes ... 
                        
                    } finally {
                    	
                    	// ... and, maybe, another thread incremented the scheduled_event count
                    	
                        sslLock.unlock();
                    }
                }
                
                // If no more data is sent or received, we may leave here with unflushed events. 
            }
        

        Sorry, no test case or fix !

        Show
        cwicklow Carl Wicklow added a comment - - edited We've just updated to Mina 2.0.13 (prompted by SSL re-ordering data, as fixed by https://issues.apache.org/jira/browse/DIRMINA-779 ). We're now bumping into an issue that looks like it's due to this ( DIRMINA-1019 ) - our application can (rarely) stall - I see the data come off the network, it goes up the filter chain into the SslFilter, but never goes up further. I have a workaround (we send a ping down the session to flush things through), but thought I'd check if you folks thought it was an application issue or a Mina issue. Here's roughly what I see in SslHandler.flushScheduledEvents() : SslHandler.java void flushScheduledEvents() { scheduled_events.incrementAndGet(); if (sslLock.tryLock()) { try { do { IoFilterEvent event; // ... and so on .... } while (scheduled_events.decrementAndGet() > 0); // Time passes ... } finally { // ... and, maybe, another thread incremented the scheduled_event count sslLock.unlock(); } } // If no more data is sent or received, we may leave here with unflushed events. } Sorry, no test case or fix !
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Finally applying the original patch actually (how strange !!!) fixes the issue.

        Applied with http://git-wip-us.apache.org/repos/asf/mina/commit/1d437477

        Thanks Terence !

        Show
        elecharny Emmanuel Lecharny added a comment - Finally applying the original patch actually (how strange !!!) fixes the issue. Applied with http://git-wip-us.apache.org/repos/asf/mina/commit/1d437477 Thanks Terence !
        Hide
        elecharny Emmanuel Lecharny added a comment -

        I applied the patch partially, as stupid as it sounds. Blame me for that...

        Show
        elecharny Emmanuel Lecharny added a comment - I applied the patch partially, as stupid as it sounds. Blame me for that...
        Show
        elecharny Emmanuel Lecharny added a comment - Test added with http://git-wip-us.apache.org/repos/asf/mina/commit/763c2fb5
        Show
        elecharny Emmanuel Lecharny added a comment - Should be fixed with http://git-wip-us.apache.org/repos/asf/mina/commit/9b5c07f9
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Actually, I'm not even sure why we protected the second while loop with the sslLock. I have moved this loop out of the protected section, and teh test is passing :

            /* no qualifier */void flushScheduledEvents() {
                // Fire events only when the lock is available for this handler.
                IoFilterEvent event;
                try {
                    sslLock.lock();
        
                    // We need synchronization here inevitably because filterWrite can be
                    // called simultaneously and cause 'bad record MAC' integrity error.
                    while ((event = filterWriteEventQueue.poll()) != null) {
                        NextFilter nextFilter = event.getNextFilter();
                        nextFilter.filterWrite(session, (WriteRequest) event.getParameter());
                    }
                } finally {
                    sslLock.unlock();
                }
        
                while ((event = messageReceivedEventQueue.poll()) != null) {
                    NextFilter nextFilter = event.getNextFilter();
                    nextFilter.messageReceived(session, event.getParameter());
                }
        
        Show
        elecharny Emmanuel Lecharny added a comment - Actually, I'm not even sure why we protected the second while loop with the sslLock. I have moved this loop out of the protected section, and teh test is passing : /* no qualifier */void flushScheduledEvents() { // Fire events only when the lock is available for this handler. IoFilterEvent event; try { sslLock.lock(); // We need synchronization here inevitably because filterWrite can be // called simultaneously and cause 'bad record MAC' integrity error. while ((event = filterWriteEventQueue.poll()) != null ) { NextFilter nextFilter = event.getNextFilter(); nextFilter.filterWrite(session, (WriteRequest) event.getParameter()); } } finally { sslLock.unlock(); } while ((event = messageReceivedEventQueue.poll()) != null ) { NextFilter nextFilter = event.getNextFilter(); nextFilter.messageReceived(session, event.getParameter()); }
        Hide
        tezm Terence Marks added a comment -

        Also attached my implementation of a fix. Really simple but does the trick. Probably abit overkill using an atomic integer.

        Show
        tezm Terence Marks added a comment - Also attached my implementation of a fix. Really simple but does the trick. Probably abit overkill using an atomic integer.
        Hide
        tezm Terence Marks added a comment -

        Here's a test demonstrating how the race condition can occur.

        Show
        tezm Terence Marks added a comment - Here's a test demonstrating how the race condition can occur.

          People

          • Assignee:
            Unassigned
            Reporter:
            tezm Terence Marks
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development