Uploaded image for project: 'MINA'
  1. MINA
  2. DIRMINA-934

Replace synchronized with a Semaphore for better performance

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.7, 2.0.8
    • Fix Version/s: 2.0.8
    • Component/s: Core
    • Labels:
    • Environment:
      Window 8 Pro x64, JDK 7

      Description

      Replacing the synchronized block with a Semaphore in the ProtocolCodecFilter provides a lot of benefit in terms of locking and also reduces CPU utilization. See attached git diff.

        Issue Links

          Activity

          Hide
          mondain Paul Gregoire added a comment -

          Git diff against 2.0 repo

          Show
          mondain Paul Gregoire added a comment - Git diff against 2.0 repo
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Sounds good to me.

          I will apply the patch next week. Thanks !

          Show
          elecharny Emmanuel Lecharny added a comment - Sounds good to me. I will apply the patch next week. Thanks !
          Hide
          elecharny Emmanuel Lecharny added a comment -

          There is still one thing that needs to be reviewed : the lock is acquired before the decoding is done, and released after the message has been processed. This is not the case in the existing code, where we just protect the decoderOut instance while decoding, which means some other thread might perfectly clear the queue qhile the current thread is expecting something to be present.

          I think there is a major flaw in the way we process the decoding, but we are most certainly protected by the fact that we don't often use an executor in the chain, and especially an unordered thread executor.

          The proposed fix not only save some CPU by removing the synchronized section, but also protect the code from more severe race conditions.

          Btw, I think that we can get rid of the ConcurrentLinkedQueue in the AbstractProtocolDecoderOutput if we use this fix. I'll do that and run the test.

          I would really appreciate that someone else review the patch and the consequences.

          Many thanks !

          Show
          elecharny Emmanuel Lecharny added a comment - There is still one thing that needs to be reviewed : the lock is acquired before the decoding is done, and released after the message has been processed. This is not the case in the existing code, where we just protect the decoderOut instance while decoding, which means some other thread might perfectly clear the queue qhile the current thread is expecting something to be present. I think there is a major flaw in the way we process the decoding, but we are most certainly protected by the fact that we don't often use an executor in the chain, and especially an unordered thread executor. The proposed fix not only save some CPU by removing the synchronized section, but also protect the code from more severe race conditions. Btw, I think that we can get rid of the ConcurrentLinkedQueue in the AbstractProtocolDecoderOutput if we use this fix. I'll do that and run the test. I would really appreciate that someone else review the patch and the consequences. Many thanks !
          Hide
          mondain Paul Gregoire added a comment -

          I'm all for improving Mina where I can; glad to help.

          Show
          mondain Paul Gregoire added a comment - I'm all for improving Mina where I can; glad to help.
          Hide
          mondain Paul Gregoire added a comment -

          Added change to AbstractProtocolDecoderOutput queue.

          Show
          mondain Paul Gregoire added a comment - Added change to AbstractProtocolDecoderOutput queue.
          Hide
          mondain Paul Gregoire added a comment -

          For what its worth, I been testing with Red5 and having these patches applied seems to work quite well.

          Show
          mondain Paul Gregoire added a comment - For what its worth, I been testing with Red5 and having these patches applied seems to work quite well.
          Hide
          mondain Paul Gregoire added a comment -

          As an aside, how do I build the 2.x version using the git repo? The build instructions on the site are for the old svn repository.

          Show
          mondain Paul Gregoire added a comment - As an aside, how do I build the 2.x version using the git repo? The build instructions on the site are for the old svn repository.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Still have to apply the changes... lack of time on my side !

          Building is simple. First, clone the repo :

          git clone http://git-wip-us.apache.org/repos/asf/mina.git mina
          and
          git checkout 2.0

          FYI I just updated the web site (it will take a couple of hours for the mirrors to be updated)

          Show
          elecharny Emmanuel Lecharny added a comment - Still have to apply the changes... lack of time on my side ! Building is simple. First, clone the repo : git clone http://git-wip-us.apache.org/repos/asf/mina.git mina and git checkout 2.0 FYI I just updated the web site (it will take a couple of hours for the mirrors to be updated)
          Hide
          elecharny Emmanuel Lecharny added a comment -

          FTR, I have locally applied the patch, and tests are passing. I'll try to commit it asap.

          Show
          elecharny Emmanuel Lecharny added a comment - FTR, I have locally applied the patch, and tests are passing. I'll try to commit it asap.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Patch applied back in march 2014

          Show
          elecharny Emmanuel Lecharny added a comment - Patch applied back in march 2014
          Hide
          quhw Huanwen Qu added a comment -

          This is definitely an important change that should be described in the release note. If we put ProtocolCodecFilter after ExcutorServiceFliter, the ProtocolCodecFilter will block the concurrent execution of the worker threads due to the semaphore lock. There is no such problem in 2.0.7. We met this issue in our production and it cost us a lot of time to find out the root cause. It's too bad.

          Show
          quhw Huanwen Qu added a comment - This is definitely an important change that should be described in the release note. If we put ProtocolCodecFilter after ExcutorServiceFliter, the ProtocolCodecFilter will block the concurrent execution of the worker threads due to the semaphore lock. There is no such problem in 2.0.7. We met this issue in our production and it cost us a lot of time to find out the root cause. It's too bad.
          Hide
          padakwaak Chris Kistner added a comment - - edited

          Thanks Huanwen Qu!

          With MINA 2.0.7, we used to create the acceptor with 20 NIO processors and it worked pretty well:

          acceptor = new NioSocketAcceptor(NIO_ACCEPTOR_THREADS);
          acceptor.getFilterChain().addLast("codec", new ProtocolCodecFilter(new OurProtocolCodecFactory()));

          However with MINA 2.0.8/2.0.9, the performance degraded considerably, where all 20 of our MINA threads were in a waiting state:

          "NioProcessor-5" - Thread t@1091
             java.lang.Thread.State: WAITING
          	at sun.misc.Unsafe.park(Native Method)
          	- parking to wait for <1562a344> (a java.util.concurrent.Semaphore$FairSync)
          	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
          	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
          	at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
          	at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:231)
          	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:542)
          	at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:48)
          	at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:943)
          	at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109)
          	at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:542)
          	at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:535)
          	at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:714)
          	at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:668)
          	at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:657)
          	at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
          	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1121)
          	at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
          	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
          	at java.lang.Thread.run(Thread.java:724)
          
             Locked ownable synchronizers:
          	- locked <72562fdf> (a java.util.concurrent.ThreadPoolExecutor$Worker)

          Now I've changed our code to add the threadpool to the filter chain to get the performance like we had with MINA 2.0.7:

          minaWorkers = new OrderedThreadPoolExecutor(1, 20, 60L, TimeUnit.SECONDS,
          	new OurLablingDefaultThreadFactory("minaWorker"), null);
          
          acceptor = new NioSocketAcceptor(2);
          acceptor.getFilterChain().addLast("codec", new ProtocolCodecFilter(new OurProtocolCodecFactory()));
          acceptor.getFilterChain().addLast("minaWorkers", new ExecutorFilter(minaWorkers));

          Alsod take note that if you don't specify the processorCount argument for the NioSocketAcceptor and you have more than 1 CPU core/thread (like if you have a Quad Core Xeon CPU with hyperthreading enabled), you may also run into this semaphore lock issue under high network loads! We haven't experimented yet with other values for the processorCount to see where the threshold/optimum performance is.

          I hope this helps other people struggling with the same issue.

          Show
          padakwaak Chris Kistner added a comment - - edited Thanks Huanwen Qu ! With MINA 2.0.7, we used to create the acceptor with 20 NIO processors and it worked pretty well: acceptor = new NioSocketAcceptor(NIO_ACCEPTOR_THREADS); acceptor.getFilterChain().addLast( "codec" , new ProtocolCodecFilter( new OurProtocolCodecFactory())); However with MINA 2.0.8/2.0.9, the performance degraded considerably, where all 20 of our MINA threads were in a waiting state: "NioProcessor-5" - Thread t@1091 java.lang.Thread.State: WAITING at sun.misc.Unsafe.park(Native Method) - parking to wait for <1562a344> (a java.util.concurrent.Semaphore$FairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:231) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:542) at org.apache.mina.core.filterchain.DefaultIoFilterChain.access$1300(DefaultIoFilterChain.java:48) at org.apache.mina.core.filterchain.DefaultIoFilterChain$EntryImpl$1.messageReceived(DefaultIoFilterChain.java:943) at org.apache.mina.core.filterchain.IoFilterAdapter.messageReceived(IoFilterAdapter.java:109) at org.apache.mina.core.filterchain.DefaultIoFilterChain.callNextMessageReceived(DefaultIoFilterChain.java:542) at org.apache.mina.core.filterchain.DefaultIoFilterChain.fireMessageReceived(DefaultIoFilterChain.java:535) at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:714) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:668) at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:657) at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67) at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1121) at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Locked ownable synchronizers: - locked <72562fdf> (a java.util.concurrent.ThreadPoolExecutor$Worker) Now I've changed our code to add the threadpool to the filter chain to get the performance like we had with MINA 2.0.7: minaWorkers = new OrderedThreadPoolExecutor(1, 20, 60L, TimeUnit.SECONDS, new OurLablingDefaultThreadFactory( "minaWorker" ), null ); acceptor = new NioSocketAcceptor(2); acceptor.getFilterChain().addLast( "codec" , new ProtocolCodecFilter( new OurProtocolCodecFactory())); acceptor.getFilterChain().addLast( "minaWorkers" , new ExecutorFilter(minaWorkers)); Alsod take note that if you don't specify the processorCount argument for the NioSocketAcceptor and you have more than 1 CPU core/thread (like if you have a Quad Core Xeon CPU with hyperthreading enabled), you may also run into this semaphore lock issue under high network loads! We haven't experimented yet with other values for the processorCount to see where the threshold/optimum performance is. I hope this helps other people struggling with the same issue.
          Hide
          jeffmaury Jeff MAURY added a comment -

          What about:
          removing the synchronization
          removing the decoder output from the session and replace by a local decoder out (that is in all cases flushed ie emptied at the end of the decoding phase

          Show
          jeffmaury Jeff MAURY added a comment - What about: removing the synchronization removing the decoder output from the session and replace by a local decoder out (that is in all cases flushed ie emptied at the end of the decoding phase
          Hide
          elecharny Emmanuel Lecharny added a comment -

          I never liked the way the decoder was implemented, from day one. It's over complicated.

          AFAICT, the idea was to decouple the decoding from the processing of incoming bytes. It was possible to inject a DecoderOut that is using an internal executor to send the decoded message while the decoding is executed. For instance, if a bufer is containing more than one message, we could imagine that each decoded message is pushed to the DecoderOut, which spawn a thread (or get one free thread) to process this decoded message, while the decoder kep going with the remaining bytes. It's just killing.

          I would rather call the decoder, get the message, push it to the chain, assuming that the implementers could have added an executor filter later on in the chain, and then keep going with the decoding. In other words, we don't need this callback mechanism which requires some synchronization. This is what you are proposing , somehow.

          wdyt ?

          Show
          elecharny Emmanuel Lecharny added a comment - I never liked the way the decoder was implemented, from day one. It's over complicated. AFAICT, the idea was to decouple the decoding from the processing of incoming bytes. It was possible to inject a DecoderOut that is using an internal executor to send the decoded message while the decoding is executed. For instance, if a bufer is containing more than one message, we could imagine that each decoded message is pushed to the DecoderOut, which spawn a thread (or get one free thread) to process this decoded message, while the decoder kep going with the remaining bytes. It's just killing. I would rather call the decoder, get the message, push it to the chain, assuming that the implementers could have added an executor filter later on in the chain, and then keep going with the decoding. In other words, we don't need this callback mechanism which requires some synchronization. This is what you are proposing , somehow. wdyt ?
          Hide
          mondain Paul Gregoire added a comment -

          What we have now works well, but I'd love to see the sync stuff go away personally. What did you guys do in 3.x for decoding, is it the same code? Eventually, I believe that IntStreams are going to replace a lot of our current routines.

          https://docs.oracle.com/javase/8/docs/api/java/util/stream/IntStream.html

          Show
          mondain Paul Gregoire added a comment - What we have now works well, but I'd love to see the sync stuff go away personally. What did you guys do in 3.x for decoding, is it the same code? Eventually, I believe that IntStreams are going to replace a lot of our current routines. https://docs.oracle.com/javase/8/docs/api/java/util/stream/IntStream.html
          Hide
          jeffmaury Jeff MAURY added a comment -

          Emmanuel,

          I think I got your point. The idea behind the decoderOut is that it is currently implemented as a queue of generated messages that are sent to the next filter upon flush. The problem is that as it is bound to the session, then we need a lock in case 2 ProcotolCodec filters got executed concurrently.
          My proposition is the following:

          • remove the synchronisation
          • replace the getDecoderOut by an overriding method (it is private in 2.0 so it cannot be overriden) and remove the session storage and return a new object each time it is called.
          Show
          jeffmaury Jeff MAURY added a comment - Emmanuel, I think I got your point. The idea behind the decoderOut is that it is currently implemented as a queue of generated messages that are sent to the next filter upon flush. The problem is that as it is bound to the session, then we need a lock in case 2 ProcotolCodec filters got executed concurrently. My proposition is the following: remove the synchronisation replace the getDecoderOut by an overriding method (it is private in 2.0 so it cannot be overriden) and remove the session storage and return a new object each time it is called.
          Hide
          elecharny Emmanuel Lecharny added a comment -

          Actually, this was a bad idea...

          The pb with the semaphore is that it creates a lock that is global for an instance of the codec filter. If one don't crate as many filter as there are sessions, then this will create a bottleneck.

          The synchronized(decoder) is also a problem though : it creates a bottleneck on the decoder, which might not be instanciated for each session, but shared. In this case, we have a serialization of the decoding across all the server, which is even worse (and that was what this patch was trying to solve, AFAICT).

          The right solution would instead be to synchronized on the session, because there is no way that could create a bottleneck :

          • each session will be able to use the decoder at will, in a serial way
          • a session won't lock another session during a decoding

          A side remark : on TCP, there is no reason we should have an executor being present before the decoder, because that would forbid the decoding of a split messages. On UDP, there is no such problem, so we may want later to remove the synchronized section when we are on UDP.

          Show
          elecharny Emmanuel Lecharny added a comment - Actually, this was a bad idea... The pb with the semaphore is that it creates a lock that is global for an instance of the codec filter. If one don't crate as many filter as there are sessions, then this will create a bottleneck. The synchronized(decoder) is also a problem though : it creates a bottleneck on the decoder, which might not be instanciated for each session, but shared. In this case, we have a serialization of the decoding across all the server, which is even worse (and that was what this patch was trying to solve, AFAICT). The right solution would instead be to synchronized on the session, because there is no way that could create a bottleneck : each session will be able to use the decoder at will, in a serial way a session won't lock another session during a decoding A side remark : on TCP, there is no reason we should have an executor being present before the decoder, because that would forbid the decoding of a split messages. On UDP, there is no such problem, so we may want later to remove the synchronized section when we are on UDP.
          Show
          elecharny Emmanuel Lecharny added a comment - Should be fixed with http://git-wip-us.apache.org/repos/asf/mina/commit/849e22af

            People

            • Assignee:
              Unassigned
              Reporter:
              mondain Paul Gregoire
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Remaining Estimate - 2h
                2h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development