Uploaded image for project: 'MINA'
  1. MINA
  2. DIRMINA-1006

mina2.0.9 NioProcessor thread make cpu 100%

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.9
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None
    • Environment:
      Linux version 2.6.32-358.el6.x86_64 (Red Hat 4.4.7-3)
      java version "1.7.0_67"
      Java(TM) SE Runtime Environment (build 1.7.0_67-b01)
      Java HotSpot(TM) 64-Bit Server VM (build 24.65-b04, mixed mode)

      Description

      running as client after the Exception (java.io.IOException: Connection reset by peer) appeared , cpu 100%

      thread dump:
      "NioProcessor-931" prio=10 tid=0x00007f3788004800 nid=0xd41 runnable [0x00007f394f4f3000]
      java.lang.Thread.State: RUNNABLE
      at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
      at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
      at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
      at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)

      • locked <0x00000007842118f0> (a sun.nio.ch.Util$2)
      • locked <0x00000007842118e0> (a java.util.Collections$UnmodifiableSet)
      • locked <0x00000007842114b0> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
        at org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:97)
        at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1074)
        at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

      Locked ownable synchronizers:

      • <0x0000000784211210> (a java.util.concurrent.ThreadPoolExecutor$Worker)

      is that nio epollWait bug?

      1. cpu100_01.png
        59 kB
        briokyd
      2. cpu100_02.png
        35 kB
        briokyd
      3. cpu100_03.png
        31 kB
        briokyd
      4. DIRMINA-1006.patch
        7 kB
        Radovan Semancik
      5. threadDump78658_20150211_151742.log
        158 kB
        briokyd

        Activity

        Hide
        briokyd briokyd added a comment -

        here is the screenshot

        Show
        briokyd briokyd added a comment - here is the screenshot
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Unlikely. We detect such a loop, and kill the selector when it happens.

        Can you attach a profiler to the running java process, to see what it's doing ?

        Show
        elecharny Emmanuel Lecharny added a comment - Unlikely. We detect such a loop, and kill the selector when it happens. Can you attach a profiler to the running java process, to see what it's doing ?
        Hide
        briokyd briokyd added a comment -

        threadDump log, same issue with "NioProcessor-939 ,nid=0xf1b"

        Show
        briokyd briokyd added a comment - threadDump log, same issue with "NioProcessor-939 ,nid=0xf1b"
        Hide
        briokyd briokyd added a comment -

        thanks for response.
        normally NioProcessor thread exit after Exception, but sometimes it wouldn't.
        i will try install Jprofiler on the server.

        Show
        briokyd briokyd added a comment - thanks for response. normally NioProcessor thread exit after Exception, but sometimes it wouldn't. i will try install Jprofiler on the server.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        AFAICT, you have 25 threads reading on a socket. The thread dump does not show a lot beside that.

        Show
        elecharny Emmanuel Lecharny added a comment - AFAICT, you have 25 threads reading on a socket. The thread dump does not show a lot beside that.
        Hide
        briokyd briokyd added a comment -

        The session was break at 14:30 PM after exception, the thread dump at 15:17 PM.
        I am confused why "NioProcessor-931" keep read the broken socket.

        Show
        briokyd briokyd added a comment - The session was break at 14:30 PM after exception, the thread dump at 15:17 PM. I am confused why "NioProcessor-931" keep read the broken socket.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        If the disconnection has been brutal, there is no way for the server to know that the socket has been closed.

        You must handle idle session in some way to determinate if a socket is dead.

        Show
        elecharny Emmanuel Lecharny added a comment - If the disconnection has been brutal, there is no way for the server to know that the socket has been closed. You must handle idle session in some way to determinate if a socket is dead.
        Hide
        semancik Radovan Semancik added a comment -

        I have more data on this.

        The thread is spinning in AbstractPollingIoProcessor$Processor.run() loop. It seems to hit the "nasty epoll spinning" as described in the source code comment, but the wakeupCalled is set to true. Therefore the outer condition (line 1050) is false and therefore the selector is not re-registerd and therefore the loop spins forever.

        What exactly is the reason for !wakeupCalled.get() in the branch condition? The subsequent code obviously sets wakeupCalled to false, so it obviously expects it to be true. But that contradicts the !wakeupCalled.get(), isn't it?

        Show
        semancik Radovan Semancik added a comment - I have more data on this. The thread is spinning in AbstractPollingIoProcessor$Processor.run() loop. It seems to hit the "nasty epoll spinning" as described in the source code comment, but the wakeupCalled is set to true. Therefore the outer condition (line 1050) is false and therefore the selector is not re-registerd and therefore the loop spins forever. What exactly is the reason for !wakeupCalled.get() in the branch condition? The subsequent code obviously sets wakeupCalled to false, so it obviously expects it to be true. But that contradicts the !wakeupCalled.get(), isn't it?
        Hide
        elecharny Emmanuel Lecharny added a comment -

        The flag is set to true in NioProcessor :

            @Override
            protected void wakeup() {
                wakeupCalled.getAndSet(true);
                selector.wakeup();
            }
        

        Basically, everytime a call to the wakeup method is done, like in the flush() method.

        Why do we check the flag in the processor loop ? Because we want to be sure that the select() calls haven't been explicitly interrupted by MINA itself (ie, by a call to wakeup()).

        So if the select() is interrupted, and the wakeupCalled is true, it's expected. The question is rather : what is calling wakeUp over and over ?

        Show
        elecharny Emmanuel Lecharny added a comment - The flag is set to true in NioProcessor : @Override protected void wakeup() { wakeupCalled.getAndSet(true); selector.wakeup(); } Basically, everytime a call to the wakeup method is done, like in the flush() method. Why do we check the flag in the processor loop ? Because we want to be sure that the select() calls haven't been explicitly interrupted by MINA itself (ie, by a call to wakeup() ). So if the select() is interrupted, and the wakeupCalled is true, it's expected. The question is rather : what is calling wakeUp over and over ?
        Hide
        semancik Radovan Semancik added a comment -

        That's simple. Nothing. Nothing is calling wakeUp() more than once. If I understrand the code correctly then once the wakeupCalled is set to true it is never reset to false. Both reset commands are behind !wakeupCalled.get() condition. So it spins forever. ... or ... did I get something wrong?

        Show
        semancik Radovan Semancik added a comment - That's simple. Nothing. Nothing is calling wakeUp() more than once. If I understrand the code correctly then once the wakeupCalled is set to true it is never reset to false. Both reset commands are behind !wakeupCalled.get() condition. So it spins forever. ... or ... did I get something wrong?
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Dang...

        The wakeupCalled should be reset to false when it was the reason the select() was interrupted...

        And in this case, if we are hit by the nasty spinning bug, then we don't take the chance to create a new selector because the flag is not false :/

        The code should probably be :

                            if (wakeupCalled.get()) {
                                wakeupCalled.getAndSet(false);
                                // We have been waked up, the processing will continue
                            } else if ((selected == 0) && (delta < 100)) {
                                // Last chance : the select() may have been
                                // interrupted because we have had an closed channel.
                                if (isBrokenConnection()) {
                                    LOG.warn("Broken connection");
                                } else {
                                    LOG.warn("Create a new selector. Selected is 0, delta = " + (t1 - t0));
                                    // Ok, we are hit by the nasty epoll
                                    // spinning.
                                    // Basically, there is a race condition
                                    // which causes a closing file descriptor not to be
                                    // considered as available as a selected channel,
                                    // but
                                    // it stopped the select. The next time we will
                                    // call select(), it will exit immediately for the
                                    // same
                                    // reason, and do so forever, consuming 100%
                                    // CPU.
                                    // We have to destroy the selector, and
                                    // register all the socket on a new one.
                                    registerNewSelector();
                                }
        
                                // and continue the loop
                                continue;
                            }
        
        Show
        elecharny Emmanuel Lecharny added a comment - Dang... The wakeupCalled should be reset to false when it was the reason the select() was interrupted... And in this case, if we are hit by the nasty spinning bug, then we don't take the chance to create a new selector because the flag is not false :/ The code should probably be : if (wakeupCalled.get()) { wakeupCalled.getAndSet(false); // We have been waked up, the processing will continue } else if ((selected == 0) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { LOG.warn("Create a new selector. Selected is 0, delta = " + (t1 - t0)); // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. registerNewSelector(); } // and continue the loop continue; }
        Hide
        semancik Radovan Semancik added a comment -

        Hmmm ... nah. Does not work.

        It is a bit more complicated. And I was wrong. There is actually wakeup() invocation further down (line 1133 of the modified source code). This gets invoked after wakeupCalled.getAndSet(false) and that sets it again to true for the next iteration. So it spins again.

        Show
        semancik Radovan Semancik added a comment - Hmmm ... nah. Does not work. It is a bit more complicated. And I was wrong. There is actually wakeup() invocation further down (line 1133 of the modified source code). This gets invoked after wakeupCalled.getAndSet(false) and that sets it again to true for the next iteration. So it spins again.
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Yeah, I realized after dinner that it was more complex that that. Here is another proposal :

                            if ((selected == 0) && !wakeupCalled.getAndSet(false) && (delta < 100)) {
                                // Last chance : the select() may have been
                                // interrupted because we have had an closed channel.
                                if (isBrokenConnection()) {
                                    LOG.warn("Broken connection");
                                } else {
                                    LOG.warn("Create a new selector. Selected is 0, delta = " + (t1 - t0));
                                    // Ok, we are hit by the nasty epoll
                                    // spinning.
                                    // Basically, there is a race condition
                                    // which causes a closing file descriptor not to be
                                    // considered as available as a selected channel,
                                    // but
                                    // it stopped the select. The next time we will
                                    // call select(), it will exit immediately for the
                                    // same
                                    // reason, and do so forever, consuming 100%
                                    // CPU.
                                    // We have to destroy the selector, and
                                    // register all the socket on a new one.
                                    registerNewSelector();
                                }
        
                                // and continue the loop
                                continue;
                            }
        

        In this case, if the flag was false, ie if we weren't woke up, and if we were hit by the spin bug, we will try again. Otherwise, if the flag was true, we have set it back to false immediately, and we can go on with the processing of some flushes, etc.

        Can you give it a try ?

        Show
        elecharny Emmanuel Lecharny added a comment - Yeah, I realized after dinner that it was more complex that that. Here is another proposal : if ((selected == 0) && !wakeupCalled.getAndSet(false) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { LOG.warn("Create a new selector. Selected is 0, delta = " + (t1 - t0)); // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. registerNewSelector(); } // and continue the loop continue; } In this case, if the flag was false, ie if we weren't woke up, and if we were hit by the spin bug, we will try again. Otherwise, if the flag was true, we have set it back to false immediately, and we can go on with the processing of some flushes, etc. Can you give it a try ?
        Hide
        elecharny Emmanuel Lecharny added a comment -

        The scenario where you have a 100% CPU is :

        • select() must return 0 (quite rare)
        • the wakeupCalled must been 'true', ie the select() was waken up by an explicit call to the wakeup() method

        As we don't reset the flag in this case, we loop.

        There is another bug :

        In the original code, we don't reset the flag immediately, so it's possible that we enter into the if with a value of false for the flag which get flip to true by a call to wakeup() just before we recreate a new selector. In this case, we will not process the remaining elements, and it's bad.

        Bottom line, when we check the wakeupCalled, we also have to reset it in the same atomic operation.

        Show
        elecharny Emmanuel Lecharny added a comment - The scenario where you have a 100% CPU is : select() must return 0 (quite rare) the wakeupCalled must been 'true', ie the select() was waken up by an explicit call to the wakeup() method As we don't reset the flag in this case, we loop. There is another bug : In the original code, we don't reset the flag immediately, so it's possible that we enter into the if with a value of false for the flag which get flip to true by a call to wakeup() just before we recreate a new selector. In this case, we will not process the remaining elements, and it's bad. Bottom line, when we check the wakeupCalled , we also have to reset it in the same atomic operation.
        Hide
        semancik Radovan Semancik added a comment -

        I can try it, but I'm quite sure that it won't work. The problem is the wakeup() call down there. That is another cause for the spin.

        Show
        semancik Radovan Semancik added a comment - I can try it, but I'm quite sure that it won't work. The problem is the wakeup() call down there. That is another cause for the spin.
        Hide
        semancik Radovan Semancik added a comment -

        Is the wakeup() after if (isDisposing()) really necessary?

        How many threads can work on the selector? If it is just one thread, then we do not need a wakeup there, do we? We know that no select is currently waiting because we are running in the very same thread.

        Show
        semancik Radovan Semancik added a comment - Is the wakeup() after if (isDisposing()) really necessary? How many threads can work on the selector? If it is just one thread, then we do not need a wakeup there, do we? We know that no select is currently waiting because we are running in the very same thread.
        Hide
        elecharny Emmanuel Lecharny added a comment - - edited

        I changed the order in the test, so that the flag is flip to false no matter what, and I removed the continue. The reason is that we need this flag to be flip to false if it was true, regardless of the potential spin bug.

        Hes, the call to wakepup() when we are disposing the selector is mandatory, because we need to process the sessions that have been put into the removingSessions queue, something that is done by a call to the removeSessions() method higher (line 1094).

        This piece of code is convoluted, and needs some love.

        Here is the latest version :

                    for (;;) {
                        try {
                            // This select has a timeout so that we can manage
                            // idle session when we get out of the select every
                            // second. (note : this is a hack to avoid creating
                            // a dedicated thread).
                            long t0 = System.currentTimeMillis();
                            int selected = select(SELECT_TIMEOUT);
                            long t1 = System.currentTimeMillis();
                            long delta = (t1 - t0);
        
                            if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) {
                                // Last chance : the select() may have been
                                // interrupted because we have had an closed channel.
                                if (isBrokenConnection()) {
                                    LOG.warn("Broken connection");
                                } else {
                                    LOG.warn("Create a new selector. Selected is 0, delta = " + (t1 - t0));
                                    // Ok, we are hit by the nasty epoll
                                    // spinning.
                                    // Basically, there is a race condition
                                    // which causes a closing file descriptor not to be
                                    // considered as available as a selected channel,
                                    // but
                                    // it stopped the select. The next time we will
                                    // call select(), it will exit immediately for the
                                    // same
                                    // reason, and do so forever, consuming 100%
                                    // CPU.
                                    // We have to destroy the selector, and
                                    // register all the socket on a new one.
                                    registerNewSelector();
                                }
                            }
        ...
        
        Show
        elecharny Emmanuel Lecharny added a comment - - edited I changed the order in the test, so that the flag is flip to false no matter what, and I removed the continue . The reason is that we need this flag to be flip to false if it was true , regardless of the potential spin bug. Hes, the call to wakepup() when we are disposing the selector is mandatory, because we need to process the sessions that have been put into the removingSessions queue, something that is done by a call to the removeSessions() method higher (line 1094). This piece of code is convoluted, and needs some love. Here is the latest version : for (;;) { try { // This select has a timeout so that we can manage // idle session when we get out of the select every // second. (note : this is a hack to avoid creating // a dedicated thread). long t0 = System.currentTimeMillis(); int selected = select(SELECT_TIMEOUT); long t1 = System.currentTimeMillis(); long delta = (t1 - t0); if (!wakeupCalled.getAndSet(false) && (selected == 0) && (delta < 100)) { // Last chance : the select() may have been // interrupted because we have had an closed channel. if (isBrokenConnection()) { LOG.warn("Broken connection"); } else { LOG.warn("Create a new selector. Selected is 0, delta = " + (t1 - t0)); // Ok, we are hit by the nasty epoll // spinning. // Basically, there is a race condition // which causes a closing file descriptor not to be // considered as available as a selected channel, // but // it stopped the select. The next time we will // call select(), it will exit immediately for the // same // reason, and do so forever, consuming 100% // CPU. // We have to destroy the selector, and // register all the socket on a new one. registerNewSelector(); } } ...
        Hide
        elecharny Emmanuel Lecharny added a comment -

        I have changed the removeSessions() method :

            private int removeSessions() {
                int removedSessions = 0;
        
                for (S session = removingSessions.poll(); session != null;session = removingSessions.poll()) {
                    SessionState state = getState(session);
        
                    // Now deal with the removal accordingly to the session's state
                    switch (state) {
                        case OPENED:
                            // Try to remove this session
                            if (removeNow(session)) {
                                removedSessions++;
                            }
                            
                            break;
            
                        case CLOSING:
                            // Skip if channel is already closed
                            // In any case, remove the session from the queue
                            removedSessions++;
                            break;
            
                        case OPENING:
                            // Remove session from the newSessions queue and
                            // remove it
                            newSessions.remove(session);
            
                            if (removeNow(session)) {
                                removedSessions++;
                            }
            
                            break;
            
                        default:
                            throw new IllegalStateException(String.valueOf(state));
                    }
                }
        
                return removedSessions;
            }
        

        This is more a workaround to get the number of sessions going down to 0 and let the processor loop exit. There is certainly more to do.

        Basically, we should analyze all the possible reasons why a session should be removed and for each use case, determinate what we should do. Here are the use cases that comes to mind :

        • the application close the session explicitly
        • the remote peer has close the connection
        • the application has disposed the processor or the service
        • we have had an exception that forces the session's closure (and here, we have many possible causes)

        In most of the use cases, the session can be removed immediately, but when the application decide to close a session, we need to take care of the pending messages, they have to be sent if the application hasn't requested the session to be closed immediately. This is the corner case we have to take care of, IMO.

        I do think we lack a state (OPENING, OPENED, CLOSING but we don't have a CLOSED state)...

        Show
        elecharny Emmanuel Lecharny added a comment - I have changed the removeSessions() method : private int removeSessions() { int removedSessions = 0; for (S session = removingSessions.poll(); session != null;session = removingSessions.poll()) { SessionState state = getState(session); // Now deal with the removal accordingly to the session's state switch (state) { case OPENED: // Try to remove this session if (removeNow(session)) { removedSessions++; } break; case CLOSING: // Skip if channel is already closed // In any case, remove the session from the queue removedSessions++; break; case OPENING: // Remove session from the newSessions queue and // remove it newSessions.remove(session); if (removeNow(session)) { removedSessions++; } break; default: throw new IllegalStateException(String.valueOf(state)); } } return removedSessions; } This is more a workaround to get the number of sessions going down to 0 and let the processor loop exit. There is certainly more to do. Basically, we should analyze all the possible reasons why a session should be removed and for each use case, determinate what we should do. Here are the use cases that comes to mind : the application close the session explicitly the remote peer has close the connection the application has disposed the processor or the service we have had an exception that forces the session's closure (and here, we have many possible causes) In most of the use cases, the session can be removed immediately, but when the application decide to close a session, we need to take care of the pending messages, they have to be sent if the application hasn't requested the session to be closed immediately. This is the corner case we have to take care of, IMO. I do think we lack a state (OPENING, OPENED, CLOSING but we don't have a CLOSED state)...
        Hide
        semancik Radovan Semancik added a comment -

        Here's the patch.
        Mina tests passing, directory API tests passing, midpoint LDAP connector tests passing and all my manual test cases work fine

        Show
        semancik Radovan Semancik added a comment - Here's the patch. Mina tests passing, directory API tests passing, midpoint LDAP connector tests passing and all my manual test cases work fine
        Hide
        elecharny Emmanuel Lecharny added a comment -

        Thanks Radovan. I'll apply the missing parts.

        I wish we have taken care of all the use cases !

        Show
        elecharny Emmanuel Lecharny added a comment - Thanks Radovan. I'll apply the missing parts. I wish we have taken care of all the use cases !
        Show
        elecharny Emmanuel Lecharny added a comment - Should be fixed with http://git-wip-us.apache.org/repos/asf/mina/commit/2d6f8256

          People

          • Assignee:
            Unassigned
            Reporter:
            briokyd briokyd
          • Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development