Solr
  1. Solr
  2. SOLR-6406

ConcurrentUpdateSolrServer hang in blockUntilFinished.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.5
    • Component/s: None
    • Labels:
      None

      Description

      Not sure what is causing this, but SOLR-6136 may have taken us a step back here. I see this problem occasionally pop up in ChaosMonkeyNothingIsSafeTest now - test fails because of a thread leak, thread leak is due to a ConcurrentUpdateSolrServer hang in blockUntilFinished. Only started popping up recently.

      1. CPU Sampling.png
        17 kB
        Stephan Lagraulet
      2. SOLR-6406.patch
        16 kB
        Yonik Seeley
      3. SOLR-6406.patch
        2 kB
        Mark Miller
      4. SOLR-6406.patch
        13 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -
             1) Thread[id=55, name=qtp823025155-55, state=WAITING, group=TGRP-ChaosMonkeyNothingIsSafeTest]
                  at java.lang.Object.wait(Native Method)
                  at java.lang.Object.wait(Object.java:503)
                  at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:374)
                  at org.apache.solr.update.StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:103)
                  at org.apache.solr.update.SolrCmdDistributor.blockAndDoRetries(SolrCmdDistributor.java:228)
                  at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:89)
                  at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:766)
                  at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1662)
                  at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179)
                  at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
                  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
                  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
                  at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
          
          Show
          Mark Miller added a comment - 1) Thread[id=55, name=qtp823025155-55, state=WAITING, group=TGRP-ChaosMonkeyNothingIsSafeTest] at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:503) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer.blockUntilFinished(ConcurrentUpdateSolrServer.java:374) at org.apache.solr.update.StreamingSolrServers.blockUntilFinished(StreamingSolrServers.java:103) at org.apache.solr.update.SolrCmdDistributor.blockAndDoRetries(SolrCmdDistributor.java:228) at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:89) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:766) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1662) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:179) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
          Hide
          Mark Miller added a comment -

          This was on a nightly of ChaosMonkeyNothingIsSafeTest. It's fairly rare.

          Show
          Mark Miller added a comment - This was on a nightly of ChaosMonkeyNothingIsSafeTest. It's fairly rare.
          Hide
          Mark Miller added a comment -

          I still see this happen in tests. This hangs at runners.wait(); and no notify or anything comes and it's just an ugly hang.

          Show
          Mark Miller added a comment - I still see this happen in tests. This hangs at runners.wait(); and no notify or anything comes and it's just an ugly hang.
          Hide
          Timothy Potter added a comment -

          Would it make sense to change the code to wait(timeoutMs) and we can recheck the state of things and going back to waiting if it makes sense vs. the indefinite way you're seeing?

          Show
          Timothy Potter added a comment - Would it make sense to change the code to wait(timeoutMs) and we can recheck the state of things and going back to waiting if it makes sense vs. the indefinite way you're seeing?
          Hide
          Mark Miller added a comment -

          Mabye. I've been trying to spot how it can happen (without a runner also still going, which I don't see). So far, I cannot spot how it happens.

          Show
          Mark Miller added a comment - Mabye. I've been trying to spot how it can happen (without a runner also still going, which I don't see). So far, I cannot spot how it happens.
          Hide
          Stephan Lagraulet added a comment -

          I have attached a cpu sample of a solr cloud server which has very poor update performance since a few hours.
          I guess it could be related to this problem.

          Show
          Stephan Lagraulet added a comment - I have attached a cpu sample of a solr cloud server which has very poor update performance since a few hours. I guess it could be related to this problem.
          Hide
          Mark Miller added a comment -

          A more recent set of stack traces:

             [junit4] ERROR   0.00s | HdfsChaosMonkeyNothingIsSafeTest (suite) <<<
             [junit4]    > Throwable #1: java.lang.AssertionError: ERROR: SolrIndexSearcher opens=39 closes=38
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([71608A03B4692CB]:0)
             [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.endTrackingSearchers(SolrTestCaseJ4.java:468)
             [junit4]    > 	at org.apache.solr.SolrTestCaseJ4.afterClass(SolrTestCaseJ4.java:234)
             [junit4]    > 	at java.lang.Thread.run(Thread.java:745)Throwable #2: com.carrotsearch.randomizedtesting.ThreadLeakError: 2 threads leaked from SUITE scope at org.apache.solr.cloud.hdfs.HdfsChaosMonkeyNothingIsSafeTest: 
             [junit4]    >    1) Thread[id=243, name=qtp487431535-243, state=WAITING, group=TGRP-HdfsChaosMonkeyNothingIsSafeTest]
             [junit4]    >         at java.lang.Object.wait(Native Method)
             [junit4]    >         at java.lang.Object.wait(Object.java:502)
             [junit4]    >         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.blockUntilFinished(ConcurrentUpdateSolrClient.java:404)
             [junit4]    >         at org.apache.solr.update.StreamingSolrClients.blockUntilFinished(StreamingSolrClients.java:103)
             [junit4]    >         at org.apache.solr.update.SolrCmdDistributor.blockAndDoRetries(SolrCmdDistributor.java:231)
             [junit4]    >         at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:89)
             [junit4]    >         at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:778)
             [junit4]    >         at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1622)
             [junit4]    >         at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:183)
             [junit4]    >         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
             [junit4]    >         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:151)
             [junit4]    >         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2079)
             [junit4]    >         at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:667)
             [junit4]    >         at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
             [junit4]    >         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)
             [junit4]    >         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
             [junit4]    >         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
             [junit4]    >         at org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:109)
             [junit4]    >         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
             [junit4]    >         at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:83)
             [junit4]    >         at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:300)
             [junit4]    >         at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
             [junit4]    >         at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
             [junit4]    >         at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:221)
             [junit4]    >         at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
             [junit4]    >         at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
             [junit4]    >         at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
             [junit4]    >         at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
             [junit4]    >         at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
             [junit4]    >         at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
             [junit4]    >         at org.eclipse.jetty.server.Server.handle(Server.java:499)
             [junit4]    >         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
             [junit4]    >         at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
             [junit4]    >         at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
             [junit4]    >         at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
             [junit4]    >         at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
             [junit4]    >         at java.lang.Thread.run(Thread.java:745)
             [junit4]    >    2) Thread[id=266, name=searcherExecutor-30-thread-1, state=WAITING, group=TGRP-HdfsChaosMonkeyNothingIsSafeTest]
             [junit4]    >         at sun.misc.Unsafe.park(Native Method)
             [junit4]    >         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
             [junit4]    >         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
             [junit4]    >         at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
             [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
             [junit4]    >         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
             [junit4]    >         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
             [junit4]    >         at java.lang.Thread.run(Thread.java:745)
             [junit4]    > 	at __randomizedtesting.SeedInfo.seed([71608A03B4692CB]:0)
          

          It would seem that runners.wait() can race with a final runners.notifyAll() or something.

          Show
          Mark Miller added a comment - A more recent set of stack traces: [junit4] ERROR 0.00s | HdfsChaosMonkeyNothingIsSafeTest (suite) <<< [junit4] > Throwable #1: java.lang.AssertionError: ERROR: SolrIndexSearcher opens=39 closes=38 [junit4] > at __randomizedtesting.SeedInfo.seed([71608A03B4692CB]:0) [junit4] > at org.apache.solr.SolrTestCaseJ4.endTrackingSearchers(SolrTestCaseJ4.java:468) [junit4] > at org.apache.solr.SolrTestCaseJ4.afterClass(SolrTestCaseJ4.java:234) [junit4] > at java.lang.Thread.run(Thread.java:745)Throwable #2: com.carrotsearch.randomizedtesting.ThreadLeakError: 2 threads leaked from SUITE scope at org.apache.solr.cloud.hdfs.HdfsChaosMonkeyNothingIsSafeTest: [junit4] > 1) Thread[id=243, name=qtp487431535-243, state=WAITING, group=TGRP-HdfsChaosMonkeyNothingIsSafeTest] [junit4] > at java.lang.Object.wait(Native Method) [junit4] > at java.lang.Object.wait(Object.java:502) [junit4] > at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.blockUntilFinished(ConcurrentUpdateSolrClient.java:404) [junit4] > at org.apache.solr.update.StreamingSolrClients.blockUntilFinished(StreamingSolrClients.java:103) [junit4] > at org.apache.solr.update.SolrCmdDistributor.blockAndDoRetries(SolrCmdDistributor.java:231) [junit4] > at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:89) [junit4] > at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:778) [junit4] > at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1622) [junit4] > at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:183) [junit4] > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) [junit4] > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:151) [junit4] > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2079) [junit4] > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:667) [junit4] > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460) [junit4] > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220) [junit4] > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179) [junit4] > at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) [junit4] > at org.apache.solr.client.solrj.embedded.JettySolrRunner$DebugFilter.doFilter(JettySolrRunner.java:109) [junit4] > at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) [junit4] > at org.eclipse.jetty.servlets.UserAgentFilter.doFilter(UserAgentFilter.java:83) [junit4] > at org.eclipse.jetty.servlets.GzipFilter.doFilter(GzipFilter.java:300) [junit4] > at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) [junit4] > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) [junit4] > at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:221) [junit4] > at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) [junit4] > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) [junit4] > at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) [junit4] > at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) [junit4] > at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) [junit4] > at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) [junit4] > at org.eclipse.jetty.server.Server.handle(Server.java:499) [junit4] > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) [junit4] > at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) [junit4] > at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) [junit4] > at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) [junit4] > at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] > 2) Thread[id=266, name=searcherExecutor-30-thread-1, state=WAITING, group=TGRP-HdfsChaosMonkeyNothingIsSafeTest] [junit4] > at sun.misc.Unsafe.park(Native Method) [junit4] > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) [junit4] > at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) [junit4] > at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) [junit4] > at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) [junit4] > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) [junit4] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] > at __randomizedtesting.SeedInfo.seed([71608A03B4692CB]:0) It would seem that runners.wait() can race with a final runners.notifyAll() or something.
          Hide
          Yonik Seeley added a comment -

          OK, here's one theory after a quick look:

                } finally {
                  synchronized (runners) {
                    if (runners.size() == 1 && !queue.isEmpty()) {
                      // keep this runner alive
                      scheduler.execute(this);
                    } else {
                      runners.remove(this);
                      if (runners.isEmpty())
                        runners.notifyAll();
                    }
                  }
          

          What if the queue isn't empty, so we try to do "scheduler.execute", but the scheduler has been shut down? That will throw an exception and the else block containing notifyAll() will never be executed.

          Show
          Yonik Seeley added a comment - OK, here's one theory after a quick look: } finally { synchronized (runners) { if (runners.size() == 1 && !queue.isEmpty()) { // keep this runner alive scheduler.execute( this ); } else { runners.remove( this ); if (runners.isEmpty()) runners.notifyAll(); } } What if the queue isn't empty, so we try to do "scheduler.execute", but the scheduler has been shut down? That will throw an exception and the else block containing notifyAll() will never be executed.
          Hide
          Yonik Seeley added a comment -

          OK, how about this patch?

          It's probably hard to read as an actual patch file, so basically what I did was:

          • pull out the "while (!queue.isEmpty())" loop into sendUpdateStream() method.
          • added a for(; loop around the call to sendUpdateStream() and the associated error handling.
          • stayed in the outer loop if we're the last runner and something else was added to the queue.
          • removed runnerLock and the code to resubmit ourselves to the executor.

          This shouldn't change any behavior except for the hypothetical buggy behavior, but it's completely untested at this point.

          Show
          Yonik Seeley added a comment - OK, how about this patch? It's probably hard to read as an actual patch file, so basically what I did was: pull out the "while (!queue.isEmpty())" loop into sendUpdateStream() method. added a for(; loop around the call to sendUpdateStream() and the associated error handling. stayed in the outer loop if we're the last runner and something else was added to the queue. removed runnerLock and the code to resubmit ourselves to the executor. This shouldn't change any behavior except for the hypothetical buggy behavior, but it's completely untested at this point.
          Hide
          Mark Miller added a comment -

          Yeah, it seems like we need to be sure we remove a Runner when it is rejected by the scheduler. Patch attached - just a quick stab.

          Show
          Mark Miller added a comment - Yeah, it seems like we need to be sure we remove a Runner when it is rejected by the scheduler. Patch attached - just a quick stab.
          Hide
          Mark Miller added a comment -

          Ha - hadn't refreshed my browser.

          I'll review this approach.

          Show
          Mark Miller added a comment - Ha - hadn't refreshed my browser. I'll review this approach.
          Hide
          Mark Miller added a comment -

          I'm testing Yonik's approach today and will see if it resolves this. My quick patch does not FWIW.

          Show
          Mark Miller added a comment - I'm testing Yonik's approach today and will see if it resolves this. My quick patch does not FWIW.
          Hide
          Mark Miller added a comment -

          Got a hang at the same spot - 2 threads stuck on it this time rather than the usual 1:

             [junit4]    >    2) Thread[id=1071, name=qtp612828486-1071, state=WAITING, group=TGRP-HdfsChaosMonkeyNothingIsSafeTest]
             [junit4]    >         at java.lang.Object.wait(Native Method)
             [junit4]    >         at java.lang.Object.wait(Object.java:502)
             [junit4]    >         at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.blockUntilFinished(ConcurrentUpdateSolrClient.java:418)
             [junit4]    >         at org.apache.solr.update.StreamingSolrClients.blockUntilFinished(StreamingSolrClients.java:106)
             [junit4]    >         at org.apache.solr.update.SolrCmdDistributor.blockAndDoRetries(SolrCmdDistributor.java:231)
             [junit4]    >         at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:89)
             [junit4]    >         at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:778)
             [junit4]    >         at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1622)
             [junit4]    >         at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:183)
             [junit4]    >         at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
             [junit4]    >         at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:151)
             [junit4]    >         at org.apache.solr.core.SolrCore.execute(SolrCore.java:2079)
             [junit4]    >         at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:667)
             [junit4]    >         at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
             [junit4]    >         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220)
             [junit4]    >         at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
          
          Show
          Mark Miller added a comment - Got a hang at the same spot - 2 threads stuck on it this time rather than the usual 1: [junit4] > 2) Thread[id=1071, name=qtp612828486-1071, state=WAITING, group=TGRP-HdfsChaosMonkeyNothingIsSafeTest] [junit4] > at java.lang.Object.wait(Native Method) [junit4] > at java.lang.Object.wait(Object.java:502) [junit4] > at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrClient.blockUntilFinished(ConcurrentUpdateSolrClient.java:418) [junit4] > at org.apache.solr.update.StreamingSolrClients.blockUntilFinished(StreamingSolrClients.java:106) [junit4] > at org.apache.solr.update.SolrCmdDistributor.blockAndDoRetries(SolrCmdDistributor.java:231) [junit4] > at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:89) [junit4] > at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:778) [junit4] > at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1622) [junit4] > at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:183) [junit4] > at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) [junit4] > at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:151) [junit4] > at org.apache.solr.core.SolrCore.execute(SolrCore.java:2079) [junit4] > at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:667) [junit4] > at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460) [junit4] > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:220) [junit4] > at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
          Hide
          Yonik Seeley added a comment -

          Got a hang at the same spot - 2 threads stuck on it this time rather than the usual 1:

          Hmmm, yeah, I only fixed one spot where runners are submitted. I'll try taking another crack at it...
          Although having 2 threads stuck at the same place in blockUntilFinished should be impossible... it's a synchronized method.

          Show
          Yonik Seeley added a comment - Got a hang at the same spot - 2 threads stuck on it this time rather than the usual 1: Hmmm, yeah, I only fixed one spot where runners are submitted. I'll try taking another crack at it... Although having 2 threads stuck at the same place in blockUntilFinished should be impossible... it's a synchronized method.
          Hide
          Mark Miller added a comment -

          Just two threads stuck - not necessarily from the same client. Previously I had only ever seen 1 thread stuck. Just noting it, may not mean much.

          Show
          Mark Miller added a comment - Just two threads stuck - not necessarily from the same client. Previously I had only ever seen 1 thread stuck. Just noting it, may not mean much.
          Hide
          Yonik Seeley added a comment -

          OK, I was able to reproduce... Interestingly, this is pretty easy to hit (and I also saw 2 threads stuck at the same point... which as you say must be 2 different client objects). There must be something more here than a subtle/little race condition.

          Show
          Yonik Seeley added a comment - OK, I was able to reproduce... Interestingly, this is pretty easy to hit (and I also saw 2 threads stuck at the same point... which as you say must be 2 different client objects). There must be something more here than a subtle/little race condition.
          Hide
          Yonik Seeley added a comment -

          Here's another patch (currently untested, but I'm going to start looping the chaos tests... they seemed to be good at hitting this).

          I added this comment to explain why I think we were hanging:
          // NOTE: if the executor is shut down, runners may never become empty (a scheduled task may never be run,
          // which means it would never remove itself from the runners list. This is why we don't wait forever
          // and periodically check if the scheduler is shutting down.

          So instead of waiting forever now, we periodically exit the wait and check if the scheduler is still running.

          I changed scheduler.isTerminated() to scheduler.isShutdown()... the latter should be true when the executor is starting to try and shut down. I think the former is only true when it actually succeeds.

          I also refactored the "add another runner" logic out to addRunner()

          Show
          Yonik Seeley added a comment - Here's another patch (currently untested, but I'm going to start looping the chaos tests... they seemed to be good at hitting this). I added this comment to explain why I think we were hanging: // NOTE: if the executor is shut down, runners may never become empty (a scheduled task may never be run, // which means it would never remove itself from the runners list. This is why we don't wait forever // and periodically check if the scheduler is shutting down. So instead of waiting forever now, we periodically exit the wait and check if the scheduler is still running. I changed scheduler.isTerminated() to scheduler.isShutdown()... the latter should be true when the executor is starting to try and shut down. I think the former is only true when it actually succeeds. I also refactored the "add another runner" logic out to addRunner()
          Hide
          Mark Miller added a comment -

          Ha, that's actually close to the first hack fix I made. Occasionally waking up in the wait and checking if empty again.

          Show
          Mark Miller added a comment - Ha, that's actually close to the first hack fix I made. Occasionally waking up in the wait and checking if empty again.
          Hide
          Yonik Seeley added a comment -

          OK, so I haven't hit any hangs with this latest patch and shutdownNow() on the associated executor. Interestingly enough though, this still results in inconsistent shard failures. My guess is that the shutdown of the executor is done as one of the last steps in CoreContainer.shutdown(), which still gives time for streaming update requests to continue streaming.

          Show
          Yonik Seeley added a comment - OK, so I haven't hit any hangs with this latest patch and shutdownNow() on the associated executor. Interestingly enough though, this still results in inconsistent shard failures. My guess is that the shutdown of the executor is done as one of the last steps in CoreContainer.shutdown(), which still gives time for streaming update requests to continue streaming.
          Hide
          ASF subversion and git services added a comment -

          Commit 1712045 from Yonik Seeley in branch 'dev/trunk'
          [ https://svn.apache.org/r1712045 ]

          SOLR-6406: fix race/hang in ConcurrentUpdateSolrClient.blockUntilFinished when executor service is shut down

          Show
          ASF subversion and git services added a comment - Commit 1712045 from Yonik Seeley in branch 'dev/trunk' [ https://svn.apache.org/r1712045 ] SOLR-6406 : fix race/hang in ConcurrentUpdateSolrClient.blockUntilFinished when executor service is shut down
          Hide
          ASF subversion and git services added a comment -

          Commit 1712047 from Yonik Seeley in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1712047 ]

          SOLR-6406: fix race/hang in ConcurrentUpdateSolrClient.blockUntilFinished when executor service is shut down

          Show
          ASF subversion and git services added a comment - Commit 1712047 from Yonik Seeley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1712047 ] SOLR-6406 : fix race/hang in ConcurrentUpdateSolrClient.blockUntilFinished when executor service is shut down
          Hide
          Yonik Seeley added a comment -

          Reopening... something is wrong.

          Overview of what happened:

          • I tested my update to DistributedUpdateProcessor in SOLR-8203 alone, and verified that there were no shard inconsistency failures
          • Mark tested his change to use shutdownNow on the updateExecutor alone (w/o my change), and reported no shard inconsistency failures, but he did hit hangs, which led to me to tackle this issue
          • I tested this issue w/o my fix in SOLR-8203, to more easily reproduce the hang, and to verify it had been fixed - I was not looking for shard inconsistency failures
          • Now that both patches are committed, I'm seeing shard inconsistency failures again!

          Either:

          • I messed up this patch somehow, causing updates to be further reordered
          • This idea of this patch is somehow incompatible with SOLR-8203 (unlikely)
          • Something else in trunk has changed (unlikely)

          First, I'm going to go back to trunk w/o both of these patches and start with just the check in the DistributedUpdateProcessor, and move on from there until I find out what reintroduced the problem.

          Show
          Yonik Seeley added a comment - Reopening... something is wrong. Overview of what happened: I tested my update to DistributedUpdateProcessor in SOLR-8203 alone, and verified that there were no shard inconsistency failures Mark tested his change to use shutdownNow on the updateExecutor alone (w/o my change), and reported no shard inconsistency failures, but he did hit hangs, which led to me to tackle this issue I tested this issue w/o my fix in SOLR-8203 , to more easily reproduce the hang, and to verify it had been fixed - I was not looking for shard inconsistency failures Now that both patches are committed, I'm seeing shard inconsistency failures again! Either: I messed up this patch somehow, causing updates to be further reordered This idea of this patch is somehow incompatible with SOLR-8203 (unlikely) Something else in trunk has changed (unlikely) First, I'm going to go back to trunk w/o both of these patches and start with just the check in the DistributedUpdateProcessor, and move on from there until I find out what reintroduced the problem.
          Hide
          Yonik Seeley added a comment -

          Update: I looped 3 tests overnight...
          1) trunk with shutdownNow on the update executor reverted
          2) trunk with shutdownNow and the client changes in this patch reverted
          3) plain trunk

          Only #3 resulted in inconsistent shards. I'm setting up some new variants to test now...

          Show
          Yonik Seeley added a comment - Update: I looped 3 tests overnight... 1) trunk with shutdownNow on the update executor reverted 2) trunk with shutdownNow and the client changes in this patch reverted 3) plain trunk Only #3 resulted in inconsistent shards. I'm setting up some new variants to test now...
          Hide
          Yonik Seeley added a comment -

          The other variants:
          4) trunk with only client changes reverted (i.e. DUH check enabled, shutdownNow used)
          5) trunk with client + DUH changes reverted (i.e. only shutdownNow enabled)
          6) trunk with alternate client changes (only changes to blockUntilFinished)
          7) trunk with alternate client changes 2 (only changes to blockUntilFinished, but using former isTerminated instead of isShutdown)

          I managed to get inconsistent shard runs with all of these.
          The common element is shutdownNow being used on the shard update executor.

          Show
          Yonik Seeley added a comment - The other variants: 4) trunk with only client changes reverted (i.e. DUH check enabled, shutdownNow used) 5) trunk with client + DUH changes reverted (i.e. only shutdownNow enabled) 6) trunk with alternate client changes (only changes to blockUntilFinished) 7) trunk with alternate client changes 2 (only changes to blockUntilFinished, but using former isTerminated instead of isShutdown) I managed to get inconsistent shard runs with all of these. The common element is shutdownNow being used on the shard update executor.
          Hide
          Mark Miller added a comment -

          Strange. I got over 300 runs without an out of sync with it originally. I have not tried on recent trunk or recent changes though.

          Show
          Mark Miller added a comment - Strange. I got over 300 runs without an out of sync with it originally. I have not tried on recent trunk or recent changes though.
          Hide
          Yonik Seeley added a comment - - edited

          I was analyzing another "shards-out-of-sync" failure on trunk.
          It looks like that certain update are just not being forwarded from the leader to a certain replica.

          Working theory: the max connections per host of the HttpClient is being hit, starving updates from certain update threads.
          This could account for why shutdownNow on the update executor service is having such an impact. In an orderly shutdown, all scheduled jobs will still be run (I think), which means that connections will be released, and the updates that were being starved will get to proceed. But it's for exactly this reason that we should probably keep the shutdownNow... it mimics much better what will happen in real world situations when a node goes down.

          From this, it looks like max connections per host is 20:

          13404 INFO  (TEST-HdfsChaosMonkeyNothingIsSafeTest.test-seed#[A22375CC545D2B82]) [    ] o.a.s.h.c.HttpShardHandlerFactory created with socketTimeout : 90000,urlScheme : ,connTimeout : 15000,maxConnectionsPerHost : 20,maxConnections : 10000,corePoolSize : 0,maximumPoolSize : 2147483647,maxThreadIdleTime : 5,sizeOfQueue : -1,fairnessPolicy : false,useRetries : false,
          

          edit: oops the above is for search not updates. The default for updates looks like it's 100, so harder to hit. Although if we have a mix of streaming and non-streaming, and connections are not reused immediately, perhaps still possible. Still digging along this line of logic.

          The test used 12 nodes (and 2 shards)... increasing the chance of hitting the max connections (since all nodes run on the same host).

          Show
          Yonik Seeley added a comment - - edited I was analyzing another "shards-out-of-sync" failure on trunk. It looks like that certain update are just not being forwarded from the leader to a certain replica. Working theory: the max connections per host of the HttpClient is being hit, starving updates from certain update threads. This could account for why shutdownNow on the update executor service is having such an impact. In an orderly shutdown, all scheduled jobs will still be run (I think), which means that connections will be released, and the updates that were being starved will get to proceed. But it's for exactly this reason that we should probably keep the shutdownNow... it mimics much better what will happen in real world situations when a node goes down. From this, it looks like max connections per host is 20: 13404 INFO (TEST-HdfsChaosMonkeyNothingIsSafeTest.test-seed#[A22375CC545D2B82]) [ ] o.a.s.h.c.HttpShardHandlerFactory created with socketTimeout : 90000,urlScheme : ,connTimeout : 15000,maxConnectionsPerHost : 20,maxConnections : 10000,corePoolSize : 0,maximumPoolSize : 2147483647,maxThreadIdleTime : 5,sizeOfQueue : -1,fairnessPolicy : false ,useRetries : false , edit: oops the above is for search not updates. The default for updates looks like it's 100, so harder to hit. Although if we have a mix of streaming and non-streaming, and connections are not reused immediately, perhaps still possible. Still digging along this line of logic. The test used 12 nodes (and 2 shards)... increasing the chance of hitting the max connections (since all nodes run on the same host).
          Hide
          Yonik Seeley added a comment -

          So increasing maxConnectionsPerHost didn't fix the problem.
          I instrumented the ConcurrentUpdateSolrServer to try and understand what is happening when, and am analyzing some of those fails now.
          They are all beyond the max size that can be uploaded to JIRA though, so I'lll just put up a summary based on what I find.

          Show
          Yonik Seeley added a comment - So increasing maxConnectionsPerHost didn't fix the problem. I instrumented the ConcurrentUpdateSolrServer to try and understand what is happening when, and am analyzing some of those fails now. They are all beyond the max size that can be uploaded to JIRA though, so I'lll just put up a summary based on what I find.
          Hide
          Stephan Lagraulet added a comment -

          Hi Yonik Seeley did you make any progress on this issue?

          Show
          Stephan Lagraulet added a comment - Hi Yonik Seeley did you make any progress on this issue?
          Hide
          Yonik Seeley added a comment -

          Actually, since this issue is more about the hang in blockUntilFinished (and that's been fixed), this can be closed.

          Show
          Yonik Seeley added a comment - Actually, since this issue is more about the hang in blockUntilFinished (and that's been fixed), this can be closed.

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Mark Miller
            • Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development