Solr
  1. Solr
  2. SOLR-7836

Possible deadlock when closing refcounted index writers.

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None

      Description

      Preliminary patch for what looks like a possible race condition between writerFree and pauseWriter in DefaultSorlCoreState.

      Looking for comments and/or why I'm completely missing the boat.

      1. deadlock_3.res.zip
        46 kB
        Erick Erickson
      2. deadlock_5_pass_iw.res.zip
        113 kB
        Erick Erickson
      3. deadlock_test
        0.8 kB
        Erick Erickson
      4. SOLR-7836.patch
        16 kB
        Yonik Seeley
      5. SOLR-7836.patch
        15 kB
        Yonik Seeley
      6. SOLR-7836.patch
        24 kB
        Erick Erickson
      7. SOLR-7836.patch
        24 kB
        Erick Erickson
      8. SOLR-7836.patch
        5 kB
        Erick Erickson
      9. SOLR-7836-reorg.patch
        6 kB
        Erick Erickson
      10. SOLR-7836-synch.patch
        1 kB
        Erick Erickson

        Activity

        Hide
        Erick Erickson added a comment -

        Looking for comments:

        This looks like I changed more than I did. Expanded the scope of the try/finally block in newIndexWriter, moved the common code for initializing outside an else clause and put a try/finally block in closeIndexWriter. The rest of the diff is just noise due to indentation.

        I have a field report and stack traces of a deadlock, here's the stack trace:

        I think that the first two in DefaultSolrCoreState are where I'm guessing the root of the problem lies.

        ********************First thread in DefaultSolrCoreState

        • The thread that owns the updateLock is stuck waiting for pauseWriter to go to false so it can get the IndexWriter:
          java.lang.Object@c0d4b9
          java.lang.Object.wait​(Native Method)
          org.apache.solr.update.DefaultSolrCoreState.getIndexWriter​(DefaultSolrCoreState.java:94)
          org.apache.solr.update.DirectUpdateHandler2.addAndDelete​(DirectUpdateHandler2.java:436)
          org.apache.solr.update.DirectUpdateHandler2.addDoc0​(DirectUpdateHandler2.java:216)
          org.apache.solr.update.DirectUpdateHandler2.addDoc​(DirectUpdateHandler2.java:160)
          org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(RunUpdateProcessorFactory.java:69)
          org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(UpdateRequestProcessor.java:51)
          org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(DistributedUpdateProcessor.java:928)
          org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(DistributedUpdateProcessor.java:1082)
          org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(DistributedUpdateProcessor.java:695)
          org.apache.solr.handler.loader.JavabinLoader$1.update​(JavabinLoader.java:96)
          org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator​(JavaBinUpdateRequestCodec.java:166)
          org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator​(JavaBinUpdateRequestCodec.java:136)
          org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:225)
          org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList​(JavaBinUpdateRequestCodec.java:121)
          org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:190)
          org.apache.solr.common.util.JavaBinCodec.unmarshal​(JavaBinCodec.java:116)
          org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal​(JavaBinUpdateRequestCodec.java:173)
          org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs​(JavabinLoader.java:106)
          org.apache.solr.handler.loader.JavabinLoader.load​(JavabinLoader.java:58)
          org.apache.solr.handler.UpdateRequestHandler$1.load​(UpdateRequestHandler.java:92)
          org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody​(ContentStreamHandlerBase.java:74)
          org.apache.solr.handler.RequestHandlerBase.handleRequest​(RequestHandlerBase.java:135)
          org.apache.solr.core.SolrCore.execute​(SolrCore.java:1956)
          org.apache.solr.servlet.SolrDispatchFilter.execute​(SolrDispatchFilter.java:799)
          org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:422)
          org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:208)
          org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter​(ServletHandler.java:1419)
          com.apple.cie.search.plugin.auth.TrustFilter.doFilter​(TrustFilter.java:43)
          org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter​(ServletHandler.java:1419)
          org.eclipse.jetty.servlet.ServletHandler.doHandle​(ServletHandler.java:455)
          org.eclipse.jetty.server.handler.ScopedHandler.handle​(ScopedHandler.java:137)
          org.eclipse.jetty.security.SecurityHandler.handle​(SecurityHandler.java:557)
          org.eclipse.jetty.server.session.SessionHandler.doHandle​(SessionHandler.java:231)
          org.eclipse.jetty.server.handler.ContextHandler.doHandle​(ContextHandler.java:1075)
          org.eclipse.jetty.servlet.ServletHandler.doScope​(ServletHandler.java:384)
          org.eclipse.jetty.server.session.SessionHandler.doScope​(SessionHandler.java:193)
          org.eclipse.jetty.server.handler.ContextHandler.doScope​(ContextHandler.java:1009)
          org.eclipse.jetty.server.handler.ScopedHandler.handle​(ScopedHandler.java:135)
          org.eclipse.jetty.server.handler.ContextHandlerCollection.handle​(ContextHandlerCollection.java:255)
          org.eclipse.jetty.server.handler.HandlerCollection.handle​(HandlerCollection.java:154)
          org.eclipse.jetty.server.handler.HandlerWrapper.handle​(HandlerWrapper.java:116)
          org.eclipse.jetty.server.Server.handle​(Server.java:368)
          org.eclipse.jetty.server.AbstractHttpConnection.handleRequest​(AbstractHttpConnection.java:489)
          org.eclipse.jetty.server.BlockingHttpConnection.handleRequest​(BlockingHttpConnection.java:53)
          org.eclipse.jetty.server.AbstractHttpConnection.content​(AbstractHttpConnection.java:953)
          org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content​(AbstractHttpConnection.java:1014)
          org.eclipse.jetty.http.HttpParser.parseNext​(HttpParser.java:953)
          org.eclipse.jetty.http.HttpParser.parseAvailable​(HttpParser.java:240)
          org.eclipse.jetty.server.BlockingHttpConnection.handle​(BlockingHttpConnection.java:72)
          org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run​(SocketConnector.java:264)
          org.eclipse.jetty.util.thread.QueuedThreadPool.runJob​(QueuedThreadPool.java:608)
          org.eclipse.jetty.util.thread.QueuedThreadPool$3.run​(QueuedThreadPool.java:543)
          java.lang.Thread.run​(Thread.java:745)

        ********************Second thread in DefaultSolrCoreState

        • The thread that sets pauseWriter to true that is supposed to later set it to false is stuck on waiting for writerFree to go to true:
          java.lang.Object@c0d4b9
          java.lang.Object.wait​(Native Method)
          org.apache.solr.update.DefaultSolrCoreState.newIndexWriter​(DefaultSolrCoreState.java:156)
          org.apache.solr.core.SolrCore.reload​(SolrCore.java:431)
          org.apache.solr.core.CoreContainer.reload​(CoreContainer.java:586)
          org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction​(CoreAdminHandler.java:701)
          org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal​(CoreAdminHandler.java:225)
          org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody​(CoreAdminHandler.java:188)
          org.apache.solr.handler.RequestHandlerBase.handleRequest​(RequestHandlerBase.java:135)
          org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest​(SolrDispatchFilter.java:751)
          org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:259)
          org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:208)
          org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter​(ServletHandler.java:1419)
          com.apple.cie.search.plugin.auth.TrustFilter.doFilter​(TrustFilter.java:43)
          org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter​(ServletHandler.java:1419)
          org.eclipse.jetty.servlet.ServletHandler.doHandle​(ServletHandler.java:455)
          org.eclipse.jetty.server.handler.ScopedHandler.handle​(ScopedHandler.java:137)
          org.eclipse.jetty.security.SecurityHandler.handle​(SecurityHandler.java:557)
          org.eclipse.jetty.server.session.SessionHandler.doHandle​(SessionHandler.java:231)
          org.eclipse.jetty.server.handler.ContextHandler.doHandle​(ContextHandler.java:1075)
          org.eclipse.jetty.servlet.ServletHandler.doScope​(ServletHandler.java:384)
          org.eclipse.jetty.server.session.SessionHandler.doScope​(SessionHandler.java:193)
          org.eclipse.jetty.server.handler.ContextHandler.doScope​(ContextHandler.java:1009)
          org.eclipse.jetty.server.handler.ScopedHandler.handle​(ScopedHandler.java:135)
          org.eclipse.jetty.server.handler.ContextHandlerCollection.handle​(ContextHandlerCollection.java:255)
          org.eclipse.jetty.server.handler.HandlerCollection.handle​(HandlerCollection.java:154)
          org.eclipse.jetty.server.handler.HandlerWrapper.handle​(HandlerWrapper.java:116)
          org.eclipse.jetty.server.Server.handle​(Server.java:368)
          org.eclipse.jetty.server.AbstractHttpConnection.handleRequest​(AbstractHttpConnection.java:489)
          org.eclipse.jetty.server.BlockingHttpConnection.handleRequest​(BlockingHttpConnection.java:53)
          org.eclipse.jetty.server.AbstractHttpConnection.content​(AbstractHttpConnection.java:953)
          org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content​(AbstractHttpConnection.java:1014)
          org.eclipse.jetty.http.HttpParser.parseNext​(HttpParser.java:861)
          org.eclipse.jetty.http.HttpParser.parseAvailable​(HttpParser.java:240)
          org.eclipse.jetty.server.BlockingHttpConnection.handle​(BlockingHttpConnection.java:72)
          org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run​(SocketConnector.java:264)
          org.eclipse.jetty.util.thread.QueuedThreadPool.runJob​(QueuedThreadPool.java:608)
          org.eclipse.jetty.util.thread.QueuedThreadPool$3.run​(QueuedThreadPool.java:543)
          java.lang.Thread.run​(Thread.java:745)

        ****************Lots of threads stuck in DirectUpdateHandler2, but I suspect these aren't where the problem really is:
        java.lang.Object@73cdbe11
        org.apache.solr.update.DirectUpdateHandler2.addAndDelete​(DirectUpdateHandler2.java:435)
        org.apache.solr.update.DirectUpdateHandler2.addDoc0​(DirectUpdateHandler2.java:216)
        org.apache.solr.update.DirectUpdateHandler2.addDoc​(DirectUpdateHandler2.java:160)
        org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(RunUpdateProcessorFactory.java:69)
        org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(UpdateRequestProcessor.java:51)
        org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(DistributedUpdateProcessor.java:928)
        org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(DistributedUpdateProcessor.java:1082)
        org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(DistributedUpdateProcessor.java:695)

        Show
        Erick Erickson added a comment - Looking for comments: This looks like I changed more than I did. Expanded the scope of the try/finally block in newIndexWriter, moved the common code for initializing outside an else clause and put a try/finally block in closeIndexWriter. The rest of the diff is just noise due to indentation. I have a field report and stack traces of a deadlock, here's the stack trace: I think that the first two in DefaultSolrCoreState are where I'm guessing the root of the problem lies. ********************First thread in DefaultSolrCoreState The thread that owns the updateLock is stuck waiting for pauseWriter to go to false so it can get the IndexWriter: java.lang.Object@c0d4b9 java.lang.Object.wait​(Native Method) org.apache.solr.update.DefaultSolrCoreState.getIndexWriter​(DefaultSolrCoreState.java:94) org.apache.solr.update.DirectUpdateHandler2.addAndDelete​(DirectUpdateHandler2.java:436) org.apache.solr.update.DirectUpdateHandler2.addDoc0​(DirectUpdateHandler2.java:216) org.apache.solr.update.DirectUpdateHandler2.addDoc​(DirectUpdateHandler2.java:160) org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(DistributedUpdateProcessor.java:928) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(DistributedUpdateProcessor.java:1082) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(DistributedUpdateProcessor.java:695) org.apache.solr.handler.loader.JavabinLoader$1.update​(JavabinLoader.java:96) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator​(JavaBinUpdateRequestCodec.java:166) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator​(JavaBinUpdateRequestCodec.java:136) org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:225) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList​(JavaBinUpdateRequestCodec.java:121) org.apache.solr.common.util.JavaBinCodec.readVal​(JavaBinCodec.java:190) org.apache.solr.common.util.JavaBinCodec.unmarshal​(JavaBinCodec.java:116) org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal​(JavaBinUpdateRequestCodec.java:173) org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs​(JavabinLoader.java:106) org.apache.solr.handler.loader.JavabinLoader.load​(JavabinLoader.java:58) org.apache.solr.handler.UpdateRequestHandler$1.load​(UpdateRequestHandler.java:92) org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody​(ContentStreamHandlerBase.java:74) org.apache.solr.handler.RequestHandlerBase.handleRequest​(RequestHandlerBase.java:135) org.apache.solr.core.SolrCore.execute​(SolrCore.java:1956) org.apache.solr.servlet.SolrDispatchFilter.execute​(SolrDispatchFilter.java:799) org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:422) org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:208) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter​(ServletHandler.java:1419) com.apple.cie.search.plugin.auth.TrustFilter.doFilter​(TrustFilter.java:43) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter​(ServletHandler.java:1419) org.eclipse.jetty.servlet.ServletHandler.doHandle​(ServletHandler.java:455) org.eclipse.jetty.server.handler.ScopedHandler.handle​(ScopedHandler.java:137) org.eclipse.jetty.security.SecurityHandler.handle​(SecurityHandler.java:557) org.eclipse.jetty.server.session.SessionHandler.doHandle​(SessionHandler.java:231) org.eclipse.jetty.server.handler.ContextHandler.doHandle​(ContextHandler.java:1075) org.eclipse.jetty.servlet.ServletHandler.doScope​(ServletHandler.java:384) org.eclipse.jetty.server.session.SessionHandler.doScope​(SessionHandler.java:193) org.eclipse.jetty.server.handler.ContextHandler.doScope​(ContextHandler.java:1009) org.eclipse.jetty.server.handler.ScopedHandler.handle​(ScopedHandler.java:135) org.eclipse.jetty.server.handler.ContextHandlerCollection.handle​(ContextHandlerCollection.java:255) org.eclipse.jetty.server.handler.HandlerCollection.handle​(HandlerCollection.java:154) org.eclipse.jetty.server.handler.HandlerWrapper.handle​(HandlerWrapper.java:116) org.eclipse.jetty.server.Server.handle​(Server.java:368) org.eclipse.jetty.server.AbstractHttpConnection.handleRequest​(AbstractHttpConnection.java:489) org.eclipse.jetty.server.BlockingHttpConnection.handleRequest​(BlockingHttpConnection.java:53) org.eclipse.jetty.server.AbstractHttpConnection.content​(AbstractHttpConnection.java:953) org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content​(AbstractHttpConnection.java:1014) org.eclipse.jetty.http.HttpParser.parseNext​(HttpParser.java:953) org.eclipse.jetty.http.HttpParser.parseAvailable​(HttpParser.java:240) org.eclipse.jetty.server.BlockingHttpConnection.handle​(BlockingHttpConnection.java:72) org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run​(SocketConnector.java:264) org.eclipse.jetty.util.thread.QueuedThreadPool.runJob​(QueuedThreadPool.java:608) org.eclipse.jetty.util.thread.QueuedThreadPool$3.run​(QueuedThreadPool.java:543) java.lang.Thread.run​(Thread.java:745) ********************Second thread in DefaultSolrCoreState The thread that sets pauseWriter to true that is supposed to later set it to false is stuck on waiting for writerFree to go to true: java.lang.Object@c0d4b9 java.lang.Object.wait​(Native Method) org.apache.solr.update.DefaultSolrCoreState.newIndexWriter​(DefaultSolrCoreState.java:156) org.apache.solr.core.SolrCore.reload​(SolrCore.java:431) org.apache.solr.core.CoreContainer.reload​(CoreContainer.java:586) org.apache.solr.handler.admin.CoreAdminHandler.handleReloadAction​(CoreAdminHandler.java:701) org.apache.solr.handler.admin.CoreAdminHandler.handleRequestInternal​(CoreAdminHandler.java:225) org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody​(CoreAdminHandler.java:188) org.apache.solr.handler.RequestHandlerBase.handleRequest​(RequestHandlerBase.java:135) org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest​(SolrDispatchFilter.java:751) org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:259) org.apache.solr.servlet.SolrDispatchFilter.doFilter​(SolrDispatchFilter.java:208) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter​(ServletHandler.java:1419) com.apple.cie.search.plugin.auth.TrustFilter.doFilter​(TrustFilter.java:43) org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter​(ServletHandler.java:1419) org.eclipse.jetty.servlet.ServletHandler.doHandle​(ServletHandler.java:455) org.eclipse.jetty.server.handler.ScopedHandler.handle​(ScopedHandler.java:137) org.eclipse.jetty.security.SecurityHandler.handle​(SecurityHandler.java:557) org.eclipse.jetty.server.session.SessionHandler.doHandle​(SessionHandler.java:231) org.eclipse.jetty.server.handler.ContextHandler.doHandle​(ContextHandler.java:1075) org.eclipse.jetty.servlet.ServletHandler.doScope​(ServletHandler.java:384) org.eclipse.jetty.server.session.SessionHandler.doScope​(SessionHandler.java:193) org.eclipse.jetty.server.handler.ContextHandler.doScope​(ContextHandler.java:1009) org.eclipse.jetty.server.handler.ScopedHandler.handle​(ScopedHandler.java:135) org.eclipse.jetty.server.handler.ContextHandlerCollection.handle​(ContextHandlerCollection.java:255) org.eclipse.jetty.server.handler.HandlerCollection.handle​(HandlerCollection.java:154) org.eclipse.jetty.server.handler.HandlerWrapper.handle​(HandlerWrapper.java:116) org.eclipse.jetty.server.Server.handle​(Server.java:368) org.eclipse.jetty.server.AbstractHttpConnection.handleRequest​(AbstractHttpConnection.java:489) org.eclipse.jetty.server.BlockingHttpConnection.handleRequest​(BlockingHttpConnection.java:53) org.eclipse.jetty.server.AbstractHttpConnection.content​(AbstractHttpConnection.java:953) org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content​(AbstractHttpConnection.java:1014) org.eclipse.jetty.http.HttpParser.parseNext​(HttpParser.java:861) org.eclipse.jetty.http.HttpParser.parseAvailable​(HttpParser.java:240) org.eclipse.jetty.server.BlockingHttpConnection.handle​(BlockingHttpConnection.java:72) org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run​(SocketConnector.java:264) org.eclipse.jetty.util.thread.QueuedThreadPool.runJob​(QueuedThreadPool.java:608) org.eclipse.jetty.util.thread.QueuedThreadPool$3.run​(QueuedThreadPool.java:543) java.lang.Thread.run​(Thread.java:745) ****************Lots of threads stuck in DirectUpdateHandler2, but I suspect these aren't where the problem really is: java.lang.Object@73cdbe11 org.apache.solr.update.DirectUpdateHandler2.addAndDelete​(DirectUpdateHandler2.java:435) org.apache.solr.update.DirectUpdateHandler2.addDoc0​(DirectUpdateHandler2.java:216) org.apache.solr.update.DirectUpdateHandler2.addDoc​(DirectUpdateHandler2.java:160) org.apache.solr.update.processor.RunUpdateProcessor.processAdd​(RunUpdateProcessorFactory.java:69) org.apache.solr.update.processor.UpdateRequestProcessor.processAdd​(UpdateRequestProcessor.java:51) org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd​(DistributedUpdateProcessor.java:928) org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd​(DistributedUpdateProcessor.java:1082) org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd​(DistributedUpdateProcessor.java:695)
        Hide
        Erick Erickson added a comment -

        OK, the patch doesn't address the real issue. I still think it's bad to leave those dangling pauseWriters around, but the real issue appears to be over in DirectUpdateHandler2. There's a ref counted index writer obtained in addDoc0. But then addDoc0 calls addAndDelete which tries to get a ref counted index writer again. If another thread sets pauseWriter in between, then it's deadlocked.

        I think the solution is to just pass the IndexWriter down to addAndDelete, but won't have time to really look until this evening.

        Show
        Erick Erickson added a comment - OK, the patch doesn't address the real issue. I still think it's bad to leave those dangling pauseWriters around, but the real issue appears to be over in DirectUpdateHandler2. There's a ref counted index writer obtained in addDoc0. But then addDoc0 calls addAndDelete which tries to get a ref counted index writer again. If another thread sets pauseWriter in between, then it's deadlocked. I think the solution is to just pass the IndexWriter down to addAndDelete, but won't have time to really look until this evening.
        Hide
        Mark Miller added a comment -

        That sounds like it's right - been awhile since I've looked at the code, but the idea is, you get a writer, use it briefly, then release it in a finally. There should not be code that gets a writer, then gets a writer, then tries to release both of them after.

        Show
        Mark Miller added a comment - That sounds like it's right - been awhile since I've looked at the code, but the idea is, you get a writer, use it briefly, then release it in a finally. There should not be code that gets a writer, then gets a writer, then tries to release both of them after.
        Hide
        Erick Erickson added a comment -

        Well, I've kicked the can down the road, but now there's a different deadlock. I wrote a test to beat the heck out of the update/delete-by-query/reload stuff and after the changes above....

        In the morning when I'm fresher I'll look at why there are so many different things that need to have locks on them. But we have updateLock, and loops like below that seem to be locking each other out.

        Yonik Seeley Mark Miller (and anyone else) Loops like this seem fraught, or is it just me and it's late?:
        while (!writerFree) {
        try

        { writerPauseLock.wait(100); }

        catch (InterruptedException e) {}
        if (closed)

        { throw new SolrException(ErrorCode.SERVICE_UNAVAILABLE, "SolrCoreState already closed"); }

        }

        and wait until some other thread sets writerFree to true.

        Show
        Erick Erickson added a comment - Well, I've kicked the can down the road, but now there's a different deadlock. I wrote a test to beat the heck out of the update/delete-by-query/reload stuff and after the changes above.... In the morning when I'm fresher I'll look at why there are so many different things that need to have locks on them. But we have updateLock, and loops like below that seem to be locking each other out. Yonik Seeley Mark Miller (and anyone else) Loops like this seem fraught, or is it just me and it's late?: while (!writerFree) { try { writerPauseLock.wait(100); } catch (InterruptedException e) {} if (closed) { throw new SolrException(ErrorCode.SERVICE_UNAVAILABLE, "SolrCoreState already closed"); } } and wait until some other thread sets writerFree to true.
        Hide
        Mark Miller added a comment -

        What's wrong with the loop?

        The code in place is pretty tricky because it had to retrofitted onto a model that did not fit the right model to make it fit the right model

        I'm not exactly sure what you are point out in the loop though.

        Do you have stack traces for the latest deadlock? Can you share the new test?

        Show
        Mark Miller added a comment - What's wrong with the loop? The code in place is pretty tricky because it had to retrofitted onto a model that did not fit the right model to make it fit the right model I'm not exactly sure what you are point out in the loop though. Do you have stack traces for the latest deadlock? Can you share the new test?
        Hide
        Erick Erickson added a comment -

        Well, something that sits in a while loop waiting for another thread to change a variable before doing anything seems like it's harder to get right than some kind of notification. As I'm finding. But I'm terrified of changing the total approach.

        I need to dig some more before posting stack traces, plus it's the trace with this patch applied (and maybe changed)....

        Mostly bringing it up to see if someone looks at the loops and says "Yuck, how did we do THAT?" before thinking about replacing the mechanism. You've shot that hope in the head though . And I'm assuming that part of the complexity here is that we want the index writer to be held by a thread as briefly as possible so as to not introduce bottlenecks.

        Anyway, I'll dig some more after I take care of morning tasks.

        Thanks!

        Show
        Erick Erickson added a comment - Well, something that sits in a while loop waiting for another thread to change a variable before doing anything seems like it's harder to get right than some kind of notification. As I'm finding. But I'm terrified of changing the total approach. I need to dig some more before posting stack traces, plus it's the trace with this patch applied (and maybe changed).... Mostly bringing it up to see if someone looks at the loops and says "Yuck, how did we do THAT ?" before thinking about replacing the mechanism. You've shot that hope in the head though . And I'm assuming that part of the complexity here is that we want the index writer to be held by a thread as briefly as possible so as to not introduce bottlenecks. Anyway, I'll dig some more after I take care of morning tasks. Thanks!
        Hide
        Mark Miller added a comment - - edited

        Previously it was done via notification - the time out and check of isClosed was added at some point because this could hang on shutdown (or at least it really slowed down shutdown - can't remember which).

        Show
        Mark Miller added a comment - - edited Previously it was done via notification - the time out and check of isClosed was added at some point because this could hang on shutdown (or at least it really slowed down shutdown - can't remember which).
        Hide
        Erick Erickson added a comment - - edited

        This might do it. I'm running 1,000 iterations (or until morning, whichever comes first) but it's gone through 150 or so already which was usually more than enough to trigger the new deadlock I found so I'm hopeful. Actually, I think this really the old deadlock, the first fix wasn't very good.

        There are two things I'd appreciate any opinions on:
        1> this patch moves the UpdateLog.add() command out of the ref counted IndexWriter in DirectUpdateHandler2.addDoc0() in multiple places (refactored to methods). But UpdateLog.add gets a new IndexWriter itself which is where the deadlock appears to be as it's fighting with CoreContainer.reload() which also calls DefaultSolrCoreState.newIndexWriter.

        2> There are two nocommits, but they are to make it easy for someone to see the other bit that other eyes would be most welcome on, I moved
        if (ulog != null) ulog.add(cmd, true);
        out of a synchronized block. This fits the pattern in other places in that code too, so I'm not too worried, but wanted to draw attention to it if anyone wants to look. Actually, there are a lot of operations on ulog.something that are synchronized on solrCoreState.getUpdateLock(), and a bunch that aren't. What's up there? The change marked by //nocommit matches the (non-synchronized) usages in addDoc0() though.

        Anyway, the problem was in DirectUpdateHandler2.addDoc0(). The ulog.add command line being inside a ref-counted IndexWriter when adding documents (addAndDelete case). ulog.add eventually tries to get a new Searcher which can be deadlocked with another thread called from SolrCore.reload since the reload wants a new IndexWriter.

        I haven't run precommit or the full test suite yet, I want to make sure the iterations I'm doing tonight work.

        This looks more intrusive than it is. I refactored some methods out of DirectUpdateHandler2.addDoc0() in order to make this pattern more visible. The original code concealed the fact that the ref counted index writer surrounded all the code, including the ulog.add.

        this is the pattern for all the refactored methods:

        RefCounted<IndexWriter> iw = solrCoreState.getIndexWriter(core);
try {
 
           IndexWriter writer = iw.get();
 
            do the right thing
        } finally {
            iw.decref();
        } 
        
        if (ulog != null) ulog.add(cmd, true);
        

        I think this is a better approach than the original fix which passed the index writer down to the addAndDelete method. we'd have had to pass the IndexWriter on to the ulog.add command or something which would have been a pain (even if it was possible).

        New test case is in this patch. Note that it's a race condition to get this to fail, so it needs to be run a lot. And using tests.iters apparently triggers the timeouts so....

        Show
        Erick Erickson added a comment - - edited This might do it. I'm running 1,000 iterations (or until morning, whichever comes first) but it's gone through 150 or so already which was usually more than enough to trigger the new deadlock I found so I'm hopeful. Actually, I think this really the old deadlock, the first fix wasn't very good. There are two things I'd appreciate any opinions on: 1> this patch moves the UpdateLog.add() command out of the ref counted IndexWriter in DirectUpdateHandler2.addDoc0() in multiple places (refactored to methods). But UpdateLog.add gets a new IndexWriter itself which is where the deadlock appears to be as it's fighting with CoreContainer.reload() which also calls DefaultSolrCoreState.newIndexWriter. 2> There are two nocommits, but they are to make it easy for someone to see the other bit that other eyes would be most welcome on, I moved if (ulog != null) ulog.add(cmd, true); out of a synchronized block. This fits the pattern in other places in that code too, so I'm not too worried, but wanted to draw attention to it if anyone wants to look. Actually, there are a lot of operations on ulog.something that are synchronized on solrCoreState.getUpdateLock(), and a bunch that aren't. What's up there? The change marked by //nocommit matches the (non-synchronized) usages in addDoc0() though. Anyway, the problem was in DirectUpdateHandler2.addDoc0(). The ulog.add command line being inside a ref-counted IndexWriter when adding documents (addAndDelete case). ulog.add eventually tries to get a new Searcher which can be deadlocked with another thread called from SolrCore.reload since the reload wants a new IndexWriter. I haven't run precommit or the full test suite yet, I want to make sure the iterations I'm doing tonight work. This looks more intrusive than it is. I refactored some methods out of DirectUpdateHandler2.addDoc0() in order to make this pattern more visible. The original code concealed the fact that the ref counted index writer surrounded all the code, including the ulog.add. this is the pattern for all the refactored methods: RefCounted<IndexWriter> iw = solrCoreState.getIndexWriter(core);
 try {
 IndexWriter writer = iw.get();
 do the right thing } finally { iw.decref(); } if (ulog != null ) ulog.add(cmd, true ); I think this is a better approach than the original fix which passed the index writer down to the addAndDelete method. we'd have had to pass the IndexWriter on to the ulog.add command or something which would have been a pain (even if it was possible). New test case is in this patch. Note that it's a race condition to get this to fail, so it needs to be run a lot. And using tests.iters apparently triggers the timeouts so....
        Hide
        Mark Miller added a comment -

        Can we get the stress test in as a nightly?

        Show
        Mark Miller added a comment - Can we get the stress test in as a nightly?
        Hide
        Erick Erickson added a comment -

        Looking a bit more, all the ulog.(some operation) commands have this at the top:

        synchronized (this) {
        

        so they're serialized. I don't like the fact that lots of these are called from within

        synchronized (solrCoreState.getUpdateLock()) {
        

        as that seems like a good place for a deadlock if it ever happens that the ulog operations want to get that same lock, but I'm not about to change that now by, say, moving all the ulog operations out of the synchronized blocks since I don't know why they were put in the first place, we can make that a separate JIRA if necessary/desirable. I'm comfortable enough with the one I moved out to go ahead with this commit.

        So I'll clean up the patch and commit unless there are objections today/tomorrow.

        Show
        Erick Erickson added a comment - Looking a bit more, all the ulog.(some operation) commands have this at the top: synchronized ( this ) { so they're serialized. I don't like the fact that lots of these are called from within synchronized (solrCoreState.getUpdateLock()) { as that seems like a good place for a deadlock if it ever happens that the ulog operations want to get that same lock, but I'm not about to change that now by, say, moving all the ulog operations out of the synchronized blocks since I don't know why they were put in the first place, we can make that a separate JIRA if necessary/desirable. I'm comfortable enough with the one I moved out to go ahead with this commit. So I'll clean up the patch and commit unless there are objections today/tomorrow.
        Hide
        Erick Erickson added a comment -

        Final patch I think. It:
        1> tightens up a bit acquiring the index writer in the extracted methods. This is really a minor nit.

        2> makes the test nightly.

        I'll commit this tonight/tomorrow, need to run it through precommit and the rest of the tests.

        Show
        Erick Erickson added a comment - Final patch I think. It: 1> tightens up a bit acquiring the index writer in the extracted methods. This is really a minor nit. 2> makes the test nightly. I'll commit this tonight/tomorrow, need to run it through precommit and the rest of the tests.
        Hide
        ASF subversion and git services added a comment -

        Commit 1694854 from Erick Erickson in branch 'dev/trunk'
        [ https://svn.apache.org/r1694854 ]

        SOLR-7836: Possible deadlock when closing refcounted index writers

        Show
        ASF subversion and git services added a comment - Commit 1694854 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1694854 ] SOLR-7836 : Possible deadlock when closing refcounted index writers
        Hide
        ASF subversion and git services added a comment -

        Commit 1694855 from Erick Erickson in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1694855 ]

        SOLR-7836: Possible deadlock when closing refcounted index writers

        Show
        ASF subversion and git services added a comment - Commit 1694855 from Erick Erickson in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1694855 ] SOLR-7836 : Possible deadlock when closing refcounted index writers
        Hide
        Erick Erickson added a comment -

        Thanks Jessica!

        Show
        Erick Erickson added a comment - Thanks Jessica!
        Hide
        Erick Erickson added a comment -

        I modified the three refactored methods in DirectUpdateHandler2 to be:

           //old code
            if (ulog != null) ulog.add(cmd);
         
        
           //new code
           synchronized (solrCoreState.getUpdateLock()) {
              if (ulog != null) ulog.add(cmd);
            }
        

        and 200 iterations later no failures from TestStressReorder (I'll try the new test code momentarily, but it'll take longer).

        Yonik Seeley Mark Miller
        Here's what I don't like about this. None of these three operations (the three new methods: addAndDelete, doNormalUpdate and allowDuplicateUpdate) couples actually changing the index with the write to the ulog. Now, only the code in addAndDelete did before, except that we may have gotten away with this because the indexWriter was grabbed at the very top of addDoc0() and effectively locked other operations out. Maybe.

        But if the ulog.add goes within the synch on the updatelock in addAndDelete, there's certainly a deadlock so putting it back the way it was isn't really an option.

        I'm starting to wonder if this isn't a bit backwards. Rather than going at this piecemeal, what about synchronizing on updateLock at the top of addDoc0 and in CoreContainer.reload() which drives one of the failure cases for deadlock? Not sure I really like the idea, but I'll give it a (local) test....

        Meanwhile, I'll check the current fix in since it's certainly better pending more experimentation.

        Show
        Erick Erickson added a comment - I modified the three refactored methods in DirectUpdateHandler2 to be: //old code if (ulog != null ) ulog.add(cmd); // new code synchronized (solrCoreState.getUpdateLock()) { if (ulog != null ) ulog.add(cmd); } and 200 iterations later no failures from TestStressReorder (I'll try the new test code momentarily, but it'll take longer). Yonik Seeley Mark Miller Here's what I don't like about this. None of these three operations (the three new methods: addAndDelete, doNormalUpdate and allowDuplicateUpdate) couples actually changing the index with the write to the ulog. Now, only the code in addAndDelete did before, except that we may have gotten away with this because the indexWriter was grabbed at the very top of addDoc0() and effectively locked other operations out. Maybe. But if the ulog.add goes within the synch on the updatelock in addAndDelete, there's certainly a deadlock so putting it back the way it was isn't really an option. I'm starting to wonder if this isn't a bit backwards. Rather than going at this piecemeal, what about synchronizing on updateLock at the top of addDoc0 and in CoreContainer.reload() which drives one of the failure cases for deadlock? Not sure I really like the idea, but I'll give it a (local) test.... Meanwhile, I'll check the current fix in since it's certainly better pending more experimentation.
        Hide
        ASF subversion and git services added a comment -

        Commit 1694913 from Erick Erickson in branch 'dev/trunk'
        [ https://svn.apache.org/r1694913 ]

        SOLR-7836: Possible deadlock when closing refcounted index writers. Surrounded ulog updates with updatelock

        Show
        ASF subversion and git services added a comment - Commit 1694913 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1694913 ] SOLR-7836 : Possible deadlock when closing refcounted index writers. Surrounded ulog updates with updatelock
        Hide
        ASF subversion and git services added a comment -

        Commit 1694914 from Erick Erickson in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1694914 ]

        SOLR-7836: Possible deadlock when closing refcounted index writers. Surrounded ulog updates with updatelock

        Show
        ASF subversion and git services added a comment - Commit 1694914 from Erick Erickson in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1694914 ] SOLR-7836 : Possible deadlock when closing refcounted index writers. Surrounded ulog updates with updatelock
        Hide
        Erick Erickson added a comment -

        bq: I'm starting to wonder if this isn't a bit backwards. Rather than going at this piecemeal, what about synchronizing on updateLock at the top of addDoc0 and in CoreContainer.reload() which drives one of the failure cases for deadlock? Not sure I really like the idea, but I'll give it a (local) test....

        That's a really bad idea, fails first time every time.

        Show
        Erick Erickson added a comment - bq: I'm starting to wonder if this isn't a bit backwards. Rather than going at this piecemeal, what about synchronizing on updateLock at the top of addDoc0 and in CoreContainer.reload() which drives one of the failure cases for deadlock? Not sure I really like the idea, but I'll give it a (local) test.... That's a really bad idea, fails first time every time.
        Hide
        Mark Miller added a comment -

        What about the nightly test?

           //new code
           synchronized (solrCoreState.getUpdateLock()) {
              if (ulog != null) ulog.add(cmd);
            }
        

        While that says getUpdateLock, it appears to have been additionally proposed as a deleteLock judging by it's callers and variable name. It looks like the old index update lock was tied into also being a delete lock.

        What is the logic that says you now need that also locking adds other than this test? Why would the index update lock that did not block adds and also was used as a delete lock also lock adds now?

        Yonik Seeley should really look at this as this looks like his code area.

        I'm skeptical that this is the right fix.

        Show
        Mark Miller added a comment - What about the nightly test? //new code synchronized (solrCoreState.getUpdateLock()) { if (ulog != null) ulog.add(cmd); } While that says getUpdateLock, it appears to have been additionally proposed as a deleteLock judging by it's callers and variable name. It looks like the old index update lock was tied into also being a delete lock. What is the logic that says you now need that also locking adds other than this test? Why would the index update lock that did not block adds and also was used as a delete lock also lock adds now? Yonik Seeley should really look at this as this looks like his code area. I'm skeptical that this is the right fix.
        Hide
        Erick Erickson added a comment -

        bq: What about the nightly test?

        I'm beating that up as well, 50+ runs and iterating without a problem so far.

        bq: While that says getUpdateLock, it appears to have been additionally proposed as a deleteLock judging by it's callers and variable name. It looks like the old index update lock was tied into also being a delete lock

        Hmmm. which may explain why changing in in addAndDelete was a problem and apparently not in the other cases.

        bq: What is the logic that says you now need that also locking adds other than this test?

        May well be flawed logic, but I find it really suspicious that the the ulog.add() call is sometimes protected and sometimes not. Although your point that it appears to really be about deletes (which the others don't do) may be why. Moving getting the indexwriter away from being done at the beginning of addDoc0() did remove a layer of protection around the ulog.add calls in all the code paths in addDoc0(). The ref counted index writer (unintentionally?) locked out some other updates possibly. ulog.add can get a new ref counted index writer which lead to the deadlock in the addAndDelete case.

        Show
        Erick Erickson added a comment - bq: What about the nightly test? I'm beating that up as well, 50+ runs and iterating without a problem so far. bq: While that says getUpdateLock, it appears to have been additionally proposed as a deleteLock judging by it's callers and variable name. It looks like the old index update lock was tied into also being a delete lock Hmmm. which may explain why changing in in addAndDelete was a problem and apparently not in the other cases. bq: What is the logic that says you now need that also locking adds other than this test? May well be flawed logic, but I find it really suspicious that the the ulog.add() call is sometimes protected and sometimes not. Although your point that it appears to really be about deletes (which the others don't do) may be why. Moving getting the indexwriter away from being done at the beginning of addDoc0() did remove a layer of protection around the ulog.add calls in all the code paths in addDoc0(). The ref counted index writer (unintentionally?) locked out some other updates possibly. ulog.add can get a new ref counted index writer which lead to the deadlock in the addAndDelete case.
        Hide
        Mark Miller added a comment -

        I find it really suspicious that the the ulog.add() call is sometimes protected and sometimes not.

        That's pretty unsound logic for making a change in such a complicated area of code...

        Show
        Mark Miller added a comment - I find it really suspicious that the the ulog.add() call is sometimes protected and sometimes not. That's pretty unsound logic for making a change in such a complicated area of code...
        Hide
        Mark Miller added a comment -

        I'm pretty sure that deadlocks around accessing an index writer should not involved synchronization work with the tlog. It may have inadvertently helped, but the two things are pretty unrelated.

        Show
        Mark Miller added a comment - I'm pretty sure that deadlocks around accessing an index writer should not involved synchronization work with the tlog. It may have inadvertently helped, but the two things are pretty unrelated.
        Hide
        Mark Miller added a comment -

        Once you get the stress test up, I'm happy to do some parallel investigation into the deadlock.

        Show
        Mark Miller added a comment - Once you get the stress test up, I'm happy to do some parallel investigation into the deadlock.
        Hide
        Mark Miller added a comment -

        Once you get the stress test up

        Nevermind, I see it's in an earlier test. I'll take a look.

        Show
        Mark Miller added a comment - Once you get the stress test up Nevermind, I see it's in an earlier test. I'll take a look.
        Hide
        Erick Erickson added a comment -

        bq: I'm pretty sure that deadlocks around accessing an index writer should not involved synchronization work with the tlog. It may have inadvertently helped, but the two things are pretty unrelated.

        I don't disagree, but the update log and index writer are intertwined, that's the problem. I'm perfectly willing to agree that they should be separated out completely, but haven't had any confirmation that they can be, or were ever intended to be separated.

        ulog.add() calls openNewSearcher which gets an indexWriter which is where things to south. Of course it calls getIndexWriter with null which has the note "core == null is a signal to just return the current writer, or null"; It doesn't really increment the reference count but does go through the interlock with pauseWriter and the like. Of course then openNewSearcher does a decref on the writer, which was never incremented in the first place and only works because the decref for index writer doesn't decrement if the count is 0.

        I've no objection to taking the two additional synchronized blocks out of DirectUpdateHandler2. The one in addAndDelete was already there although it was enclosed by getting an index writer (which is where all the problems happened). I'm not adverse to taking that one out too

        BTW, you can't use tests.iters for the new test. I didn't want to wait for the default suite timeout so I set it locally to 10 minutes and that timer apparently runs across all iters. I wrote a shell script to re-invoke the test for a long time (500 times last night).

        Show
        Erick Erickson added a comment - bq: I'm pretty sure that deadlocks around accessing an index writer should not involved synchronization work with the tlog. It may have inadvertently helped, but the two things are pretty unrelated. I don't disagree, but the update log and index writer are intertwined, that's the problem. I'm perfectly willing to agree that they should be separated out completely, but haven't had any confirmation that they can be, or were ever intended to be separated. ulog.add() calls openNewSearcher which gets an indexWriter which is where things to south. Of course it calls getIndexWriter with null which has the note "core == null is a signal to just return the current writer, or null"; It doesn't really increment the reference count but does go through the interlock with pauseWriter and the like. Of course then openNewSearcher does a decref on the writer, which was never incremented in the first place and only works because the decref for index writer doesn't decrement if the count is 0. I've no objection to taking the two additional synchronized blocks out of DirectUpdateHandler2. The one in addAndDelete was already there although it was enclosed by getting an index writer (which is where all the problems happened). I'm not adverse to taking that one out too BTW, you can't use tests.iters for the new test. I didn't want to wait for the default suite timeout so I set it locally to 10 minutes and that timer apparently runs across all iters. I wrote a shell script to re-invoke the test for a long time (500 times last night).
        Hide
        Mark Miller added a comment -

        Still trying to duplicate a hang with the above patch removed, but one strange thing I had to work around because the test kept failing relatively quickly for me (JVM bug?):

        Index: solr/core/src/java/org/apache/solr/core/SolrCore.java
        ===================================================================
        --- solr/core/src/java/org/apache/solr/core/SolrCore.java	(revision 1695180)
        +++ solr/core/src/java/org/apache/solr/core/SolrCore.java	(working copy)
        @@ -1639,7 +1639,9 @@
                   tmp = new SolrIndexSearcher(this, newIndexDir, getLatestSchema(),
                       (realtime ? "realtime":"main"), newReader, true, !realtime, true, directoryFactory);
                 } else  {
        -          RefCounted<IndexWriter> writer = getUpdateHandler().getSolrCoreState().getIndexWriter(this);
        +          // when this was getUpdateHandler#getSolrCoreState it could hit an NPE somehow,
        +          // even though variables are all final
        +          RefCounted<IndexWriter> writer = solrCoreState.getIndexWriter(this);
                   DirectoryReader newReader = null;
                   try {
                     newReader = indexReaderFactory.newReader(writer.get(), this);
        
        Show
        Mark Miller added a comment - Still trying to duplicate a hang with the above patch removed, but one strange thing I had to work around because the test kept failing relatively quickly for me (JVM bug?): Index: solr/core/src/java/org/apache/solr/core/SolrCore.java =================================================================== --- solr/core/src/java/org/apache/solr/core/SolrCore.java (revision 1695180) +++ solr/core/src/java/org/apache/solr/core/SolrCore.java (working copy) @@ -1639,7 +1639,9 @@ tmp = new SolrIndexSearcher( this , newIndexDir, getLatestSchema(), (realtime ? "realtime" : "main" ), newReader, true , !realtime, true , directoryFactory); } else { - RefCounted<IndexWriter> writer = getUpdateHandler().getSolrCoreState().getIndexWriter( this ); + // when this was getUpdateHandler#getSolrCoreState it could hit an NPE somehow, + // even though variables are all final + RefCounted<IndexWriter> writer = solrCoreState.getIndexWriter( this ); DirectoryReader newReader = null ; try { newReader = indexReaderFactory.newReader(writer.get(), this );
        Hide
        Yonik Seeley added a comment -

        there are a lot of operations on ulog.something that are synchronized on solrCoreState.getUpdateLock(), and a bunch that aren't. What's up there?

        Some background here: I wrote the original tlog code (and DUH2 code that called it). There was no solrCoreState.getUpdateLock() (and no sharing writers across reloads even). Mark implemented that part and changed synchronized(this) to synchronized(solrCoreState.getUpdateLock()) I believe (to account for the fact that we could have 2 DUH2 instances).

        Hopefully there are comments about when something is synchronized (and why it needed to be). The intent was to have the common case unsynchronized for best throughput. For example, I don't believe writer.updateDocument for the common case is synchronized. That would be bad for indexing performance.

        deleteByQuery (or an add where we detect a reordered DBQ that we need to apply again) contains the following
        comment next to the synchronize statement:

              //
              // synchronized to prevent deleteByQuery from running during the "open new searcher"
              // part of a commit.  DBQ needs to signal that a fresh reader will be needed for
              // a realtime view of the index.  When a new searcher is opened after a DBQ, that
              // flag can be cleared.  If those thing happen concurrently, it's not thread safe.
              //
        

        I'm re-reviewing all this code now to get it back in my head...

        Show
        Yonik Seeley added a comment - there are a lot of operations on ulog.something that are synchronized on solrCoreState.getUpdateLock(), and a bunch that aren't. What's up there? Some background here: I wrote the original tlog code (and DUH2 code that called it). There was no solrCoreState.getUpdateLock() (and no sharing writers across reloads even). Mark implemented that part and changed synchronized(this) to synchronized(solrCoreState.getUpdateLock()) I believe (to account for the fact that we could have 2 DUH2 instances). Hopefully there are comments about when something is synchronized (and why it needed to be). The intent was to have the common case unsynchronized for best throughput. For example, I don't believe writer.updateDocument for the common case is synchronized. That would be bad for indexing performance. deleteByQuery (or an add where we detect a reordered DBQ that we need to apply again) contains the following comment next to the synchronize statement: // // synchronized to prevent deleteByQuery from running during the "open new searcher" // part of a commit. DBQ needs to signal that a fresh reader will be needed for // a realtime view of the index. When a new searcher is opened after a DBQ, that // flag can be cleared. If those thing happen concurrently, it's not thread safe. // I'm re-reviewing all this code now to get it back in my head...
        Hide
        Erick Erickson added a comment -

        Thanks guys, getting it all in my head is...interesting.

        Yonik Seeley Yeah, I saw that comment. In that case, removing the two synchronizations in the refactored methods other than addAndDelete is probably indicated. The one in addAndDelete was there originally, just within the IndexWriter try/finally which is where the issues was since it'd go out and get a new searcher eventually.

        Mark Miller I had to write a shell script to re-submit that individual test repeatedly, it'd pretty much always fail for me by 50 runs. I'll back those changes out and run it on my machine where it fails reliably and post the results when I get a deadlock.

        Show
        Erick Erickson added a comment - Thanks guys, getting it all in my head is...interesting. Yonik Seeley Yeah, I saw that comment. In that case, removing the two synchronizations in the refactored methods other than addAndDelete is probably indicated. The one in addAndDelete was there originally, just within the IndexWriter try/finally which is where the issues was since it'd go out and get a new searcher eventually. Mark Miller I had to write a shell script to re-submit that individual test repeatedly, it'd pretty much always fail for me by 50 runs. I'll back those changes out and run it on my machine where it fails reliably and post the results when I get a deadlock.
        Hide
        Mark Miller added a comment -

        Yeah, I have a beasting script that does the same, though can also launch runs in parallel (https://gist.github.com/markrmiller/dbdb792216dc98b018ad). Still no deadlock on my machine yet though. I'll keep trying for a while though.

        Show
        Mark Miller added a comment - Yeah, I have a beasting script that does the same, though can also launch runs in parallel ( https://gist.github.com/markrmiller/dbdb792216dc98b018ad ). Still no deadlock on my machine yet though. I'll keep trying for a while though.
        Hide
        Erick Erickson added a comment -

        Here's the stack trace. A few things:

        > I got this source by "svn checkout -r1694809 https://svn.apache.org/repos/asf/lucene/dev/trunk" and added the test to that code base.

        > It appears to need two things to fail; a reload operation and a delete by query.

        > thread WRITER5 and TEST-TestReloadDeadlock.testReloadDeadlock-seed#[4CFFCB253DB33784]" are the ones I think are fighting here.

        > The original fix was to pass the index writer from addDoc0() to addAndDelete, but this doesn't work either. I'll see if I can attach a run with that change for comparison.

        > I've also attached the script I use to run this, although I don't run it in parallel.

        Show
        Erick Erickson added a comment - Here's the stack trace. A few things: > I got this source by "svn checkout -r1694809 https://svn.apache.org/repos/asf/lucene/dev/trunk " and added the test to that code base. > It appears to need two things to fail; a reload operation and a delete by query. > thread WRITER5 and TEST-TestReloadDeadlock.testReloadDeadlock-seed# [4CFFCB253DB33784] " are the ones I think are fighting here. > The original fix was to pass the index writer from addDoc0() to addAndDelete, but this doesn't work either. I'll see if I can attach a run with that change for comparison. > I've also attached the script I use to run this, although I don't run it in parallel.
        Hide
        Erick Erickson added a comment -

        Here's a fail with passing the index writer from addDoc0() to addAndDelete, my first attempt at a "fix".

        I commented out the necessary lines rather than add or delete them, so the line numbers should correspond to the checkout I mentioned above.

        Show
        Erick Erickson added a comment - Here's a fail with passing the index writer from addDoc0() to addAndDelete, my first attempt at a "fix". I commented out the necessary lines rather than add or delete them, so the line numbers should correspond to the checkout I mentioned above.
        Hide
        Mark Miller added a comment -

        Strange, I have not been able to get the deadlock with that test on my machine.

        Do you have the full set of stack traces for a deadlock state available?

        Show
        Mark Miller added a comment - Strange, I have not been able to get the deadlock with that test on my machine. Do you have the full set of stack traces for a deadlock state available?
        Hide
        Erick Erickson added a comment -

        Mark Miller I thought everything was in the two zip files I attached a couple of days ago, what did I miss? See my comments two days ago for the provenance.

        BTW, I'll be intermittently available today and not at all tomorrow (business hours CA time) if you'd like to chat.

        Show
        Erick Erickson added a comment - Mark Miller I thought everything was in the two zip files I attached a couple of days ago, what did I miss? See my comments two days ago for the provenance. BTW, I'll be intermittently available today and not at all tomorrow (business hours CA time) if you'd like to chat.
        Hide
        Mark Miller added a comment -

        Ah, sorry, I was looking for the full strack traces pre two days ago - didn't realize they were in the zip. I've been having trouble extracting that into something viable on my mac. I'll try on my Ubuntu machine.

        Show
        Mark Miller added a comment - Ah, sorry, I was looking for the full strack traces pre two days ago - didn't realize they were in the zip. I've been having trouble extracting that into something viable on my mac. I'll try on my Ubuntu machine.
        Hide
        Erick Erickson added a comment -

        Let me know if you need them in another format. I did just download them and extract, so they probably aren't corrupt at least.

        Show
        Erick Erickson added a comment - Let me know if you need them in another format. I did just download them and extract, so they probably aren't corrupt at least.
        Hide
        Yonik Seeley added a comment -

        Yeah, I saw that comment. In that case, removing the two synchronizations in the refactored methods other than addAndDelete is probably indicated. The one in addAndDelete was there originally, just within the IndexWriter try/finally which is where the issues was since it'd go out and get a new searcher eventually.

        Maybe... but it's also worth noting that those comments (and my thinking around it) were before (i think) index writers could be shared across reloads.

        It looks fine to me that you moved the ulog.add() outside of the writer get/release.

        Show
        Yonik Seeley added a comment - Yeah, I saw that comment. In that case, removing the two synchronizations in the refactored methods other than addAndDelete is probably indicated. The one in addAndDelete was there originally, just within the IndexWriter try/finally which is where the issues was since it'd go out and get a new searcher eventually. Maybe... but it's also worth noting that those comments (and my thinking around it) were before (i think) index writers could be shared across reloads. It looks fine to me that you moved the ulog.add() outside of the writer get/release.
        Hide
        Yonik Seeley added a comment -

        Let me see how repeatable a deadlock is on my system, then we can look into removing some of the extra sync...

        Show
        Yonik Seeley added a comment - Let me see how repeatable a deadlock is on my system, then we can look into removing some of the extra sync...
        Hide
        Erick Erickson added a comment - - edited

        bq: It looks fine to me that you moved the ulog.add() outside of the writer get/release.

        That's exactly what makes me nervous. But having it within the get/release is where the problem was.

        Let me know if there's anything you'd like to chat about.

        And it wasn't predictably repeatable (of course). Sometimes the script I attached would run 10-15 iterations OK then barf. Sometimes faster. Never more than 50. With the changes I ran with 400-500 iterations without a problem.

        Since I can get it to fail reasonably predictably though, let me know if you want me to run any changes locally.

        Show
        Erick Erickson added a comment - - edited bq: It looks fine to me that you moved the ulog.add() outside of the writer get/release. That's exactly what makes me nervous. But having it within the get/release is where the problem was. Let me know if there's anything you'd like to chat about. And it wasn't predictably repeatable (of course). Sometimes the script I attached would run 10-15 iterations OK then barf. Sometimes faster. Never more than 50. With the changes I ran with 400-500 iterations without a problem. Since I can get it to fail reasonably predictably though, let me know if you want me to run any changes locally.
        Hide
        Mark Miller added a comment -

        It looks fine to me that you moved the ulog.add() outside of the writer get/release.

        Doh, that is my problem - I only reverted the first commit to DirectUpdateHandler2.

        Show
        Mark Miller added a comment - It looks fine to me that you moved the ulog.add() outside of the writer get/release. Doh, that is my problem - I only reverted the first commit to DirectUpdateHandler2.
        Hide
        Mark Miller added a comment -

        That's exactly what makes me nervous.

        That should make you nervous - the random test fails around the transcation log tests after should be more in the terrified range though

        Show
        Mark Miller added a comment - That's exactly what makes me nervous. That should make you nervous - the random test fails around the transcation log tests after should be more in the terrified range though
        Hide
        Yonik Seeley added a comment -

        Erick, I just now ran across this email from you 4 days ago:

        Backed out SOLR-7836 changes and TestStressReorder doesn't fail in 100
        iterations, failed 16 times in 100 iterations with changes.
        Digging....

        Is this still the case (that TestStressReorder now sometimes fails?)

        Show
        Yonik Seeley added a comment - Erick, I just now ran across this email from you 4 days ago: Backed out SOLR-7836 changes and TestStressReorder doesn't fail in 100 iterations, failed 16 times in 100 iterations with changes. Digging.... Is this still the case (that TestStressReorder now sometimes fails?)
        Hide
        Erick Erickson added a comment -

        bq: Is this still the case (that TestStressReorder now sometimes fails?)

        Yonik Seeley It better not be still failing or I'll have to fall on my sword. I'm pretty sure I fixed that up, I can't imagine that I would just check the changes in since the TestStressReorder was what I ripped off to make my new test to avoid that deadlock and the link between the two is pretty obvious, even to me

        I can't do it right now, but I'll run TestStressReorder a lot this evening to be sure.

        Show
        Erick Erickson added a comment - bq: Is this still the case (that TestStressReorder now sometimes fails?) Yonik Seeley It better not be still failing or I'll have to fall on my sword. I'm pretty sure I fixed that up, I can't imagine that I would just check the changes in since the TestStressReorder was what I ripped off to make my new test to avoid that deadlock and the link between the two is pretty obvious, even to me I can't do it right now, but I'll run TestStressReorder a lot this evening to be sure.
        Hide
        Yonik Seeley added a comment -

        OK, so it wasn't too hard for me to replicate the deadlocks w/ the latest commits backed out, and the traces are similar to what was posted before... it's getIndexWriter fighting with newIndexWriter.

          2> "WRITER10" ID=28 TIMED_WAITING on java.lang.Object@d8cdd45
          2>    at java.lang.Object.wait(Native Method)
          2>    - timed waiting on java.lang.Object@d8cdd45
          2>    at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:96)
          2>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1588)
          2>    at org.apache.solr.update.UpdateLog.add(UpdateLog.java:455)
          2>    - locked org.apache.solr.update.UpdateLog@1df2955d
          2>    at org.apache.solr.update.DirectUpdateHandler2.addAndDelete(DirectUpdateHandler2.java:452)
        [...]
          2> "TEST-TestReloadDeadlock.testReloadDeadlock-seed#[D13A45EBBFA304C4]" ID=12 TIMED_WAITING on java.lang.Object@d8cdd45
          2>    at java.lang.Object.wait(Native Method)
          2>    - timed waiting on java.lang.Object@d8cdd45
          2>    at org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:158)
          2>    - locked org.apache.solr.update.DefaultSolrCoreState@7d338874
          2>    at org.apache.solr.core.SolrCore.reload(SolrCore.java:479)
          2>    at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:830)
          2>    at org.apache.solr.search.TestReloadDeadlock.testReloadDeadlock(TestReloadDeadlock.java:182)
        

        On straight trunk, I still get failures (just not deadlocks):

          2> 3406 ERROR (WRITER6) [    ] o.a.s.c.SolrCore org.apache.solr.common.SolrException: Error opening new searcher
          2>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1673)
          2>    at org.apache.solr.core.SolrCore.getRealtimeSearcher(SolrCore.java:1530)
          2>    at org.apache.solr.update.VersionInfo.getVersionFromIndex(VersionInfo.java:202)
          2>    at org.apache.solr.update.UpdateLog.lookupVersion(UpdateLog.java:783)
          2>    at org.apache.solr.update.VersionInfo.lookupVersion(VersionInfo.java:195)
          2>    at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1088)
          2>    at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:705)
          2>    at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104)
          2>    at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:470)
          2>    at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:134)
          2>    at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:113)
          2>    at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:76)
          2>    at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
          2>    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
          2>    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:151)
          2>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2079)
          2>    at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:131)
          2>    at org.apache.solr.SolrTestCaseJ4.updateJ(SolrTestCaseJ4.java:1104)
          2>    at org.apache.solr.SolrTestCaseJ4.addAndGetVersion(SolrTestCaseJ4.java:1250)
          2>    at org.apache.solr.search.TestReloadDeadlock.addDoc(TestReloadDeadlock.java:200)
          2>    at org.apache.solr.search.TestReloadDeadlock.access$100(TestReloadDeadlock.java:46)
          2>    at org.apache.solr.search.TestReloadDeadlock$1.run(TestReloadDeadlock.java:156)
          2> Caused by: java.lang.NullPointerException
          2>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1642)
          2>    ... 21 more
        
        Show
        Yonik Seeley added a comment - OK, so it wasn't too hard for me to replicate the deadlocks w/ the latest commits backed out, and the traces are similar to what was posted before... it's getIndexWriter fighting with newIndexWriter. 2> "WRITER10" ID=28 TIMED_WAITING on java.lang. Object @d8cdd45 2> at java.lang. Object .wait(Native Method) 2> - timed waiting on java.lang. Object @d8cdd45 2> at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:96) 2> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1588) 2> at org.apache.solr.update.UpdateLog.add(UpdateLog.java:455) 2> - locked org.apache.solr.update.UpdateLog@1df2955d 2> at org.apache.solr.update.DirectUpdateHandler2.addAndDelete(DirectUpdateHandler2.java:452) [...] 2> "TEST-TestReloadDeadlock.testReloadDeadlock-seed#[D13A45EBBFA304C4]" ID=12 TIMED_WAITING on java.lang. Object @d8cdd45 2> at java.lang. Object .wait(Native Method) 2> - timed waiting on java.lang. Object @d8cdd45 2> at org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:158) 2> - locked org.apache.solr.update.DefaultSolrCoreState@7d338874 2> at org.apache.solr.core.SolrCore.reload(SolrCore.java:479) 2> at org.apache.solr.core.CoreContainer.reload(CoreContainer.java:830) 2> at org.apache.solr.search.TestReloadDeadlock.testReloadDeadlock(TestReloadDeadlock.java:182) On straight trunk, I still get failures (just not deadlocks): 2> 3406 ERROR (WRITER6) [ ] o.a.s.c.SolrCore org.apache.solr.common.SolrException: Error opening new searcher 2> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1673) 2> at org.apache.solr.core.SolrCore.getRealtimeSearcher(SolrCore.java:1530) 2> at org.apache.solr.update.VersionInfo.getVersionFromIndex(VersionInfo.java:202) 2> at org.apache.solr.update.UpdateLog.lookupVersion(UpdateLog.java:783) 2> at org.apache.solr.update.VersionInfo.lookupVersion(VersionInfo.java:195) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1088) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:705) 2> at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:104) 2> at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.handleAdds(JsonLoader.java:470) 2> at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:134) 2> at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:113) 2> at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:76) 2> at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98) 2> at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) 2> at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:151) 2> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2079) 2> at org.apache.solr.servlet.DirectSolrConnection.request(DirectSolrConnection.java:131) 2> at org.apache.solr.SolrTestCaseJ4.updateJ(SolrTestCaseJ4.java:1104) 2> at org.apache.solr.SolrTestCaseJ4.addAndGetVersion(SolrTestCaseJ4.java:1250) 2> at org.apache.solr.search.TestReloadDeadlock.addDoc(TestReloadDeadlock.java:200) 2> at org.apache.solr.search.TestReloadDeadlock.access$100(TestReloadDeadlock.java:46) 2> at org.apache.solr.search.TestReloadDeadlock$1.run(TestReloadDeadlock.java:156) 2> Caused by: java.lang.NullPointerException 2> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1642) 2> ... 21 more
        Hide
        Mark Miller added a comment -

        and the traces are similar to what was posted before

        There is not a third thread that has the Writer out? Both of those traces line up with another thread having the writer. I don't spend why they would deadlock each other without another thread involved.

        I'll take a closer look now that I can duplicate it. I still suspect the proper fix for this is in SolrCoreState.

        Show
        Mark Miller added a comment - and the traces are similar to what was posted before There is not a third thread that has the Writer out? Both of those traces line up with another thread having the writer. I don't spend why they would deadlock each other without another thread involved. I'll take a closer look now that I can duplicate it. I still suspect the proper fix for this is in SolrCoreState.
        Hide
        Erick Erickson added a comment -

        I ran the StressTestReorder yesterday with the changes and didn't get it to fail for 50 runs on my laptop. Then last night I kicked off a 1,000 repetition run on my home machine and... didn't remember to look this morning before I left home.

        So I'm pretty sure that StressTestReorder is OK with the latest patch, I'll look tonight to be sure.

        Yonik Seeley I saw what I think is the same failure on the dev list from a Jenkins build last night and AFAIK the changes are still in there.

        Show
        Erick Erickson added a comment - I ran the StressTestReorder yesterday with the changes and didn't get it to fail for 50 runs on my laptop. Then last night I kicked off a 1,000 repetition run on my home machine and... didn't remember to look this morning before I left home. So I'm pretty sure that StressTestReorder is OK with the latest patch, I'll look tonight to be sure. Yonik Seeley I saw what I think is the same failure on the dev list from a Jenkins build last night and AFAIK the changes are still in there.
        Hide
        Yonik Seeley added a comment -

        There is not a third thread that has the Writer out?

        I just checked again... didn't see another.

        Both of those traces line up with another thread having the writer. I don't spend why they would deadlock each other without another thread involved.

        The first thread has the writer? I have never really reviewed SolrCoreState in depth though... so I'm still not really sure.

        Show
        Yonik Seeley added a comment - There is not a third thread that has the Writer out? I just checked again... didn't see another. Both of those traces line up with another thread having the writer. I don't spend why they would deadlock each other without another thread involved. The first thread has the writer? I have never really reviewed SolrCoreState in depth though... so I'm still not really sure.
        Hide
        Yonik Seeley added a comment -

        Here's the scenario w/ the old code.

        thread1 does an add:
        1) in DUH2, calls coreState.getIndexWriter()

        • this increments the ref count on the writer and sets coreState.writerFree=false
          2) calls UpdateLog.add

        thread2 calls core reload():
        3) calls coreState.newIndexWriter()

        • coreState.pauseWriter is set to true, so no new references will be handed out
        • goes into a loop waiting for writerFree=true (for all other references that were handed out to be returned)

        thread1:
        4) UpdateLog.add continues and indirectly causes coreState.getIndexWriter() to be called

        • sees coreState.pauseWriter set to true and thus does into wait loop

        So: one can't call getIndexWriter() and then do anything else that will eventually call getIndexWriter() or newIndexWriter()
        If we keep that restriction, then moving the ulog.add outside of the getIndexWriter/release block was correct.

        Don't know about the other changes... the extra sync added in DUH2 does still seem unnecessary. And I haven't looked at what changes were made to SolrCoreState yet.

        Show
        Yonik Seeley added a comment - Here's the scenario w/ the old code. thread1 does an add: 1) in DUH2, calls coreState.getIndexWriter() this increments the ref count on the writer and sets coreState.writerFree=false 2) calls UpdateLog.add thread2 calls core reload(): 3) calls coreState.newIndexWriter() coreState.pauseWriter is set to true, so no new references will be handed out goes into a loop waiting for writerFree=true (for all other references that were handed out to be returned) thread1: 4) UpdateLog.add continues and indirectly causes coreState.getIndexWriter() to be called sees coreState.pauseWriter set to true and thus does into wait loop So: one can't call getIndexWriter() and then do anything else that will eventually call getIndexWriter() or newIndexWriter() If we keep that restriction, then moving the ulog.add outside of the getIndexWriter/release block was correct. Don't know about the other changes... the extra sync added in DUH2 does still seem unnecessary. And I haven't looked at what changes were made to SolrCoreState yet.
        Hide
        Erick Erickson added a comment -

        Yonik Seeley Glad to see you're seeing the same thing I saw, senility is not here yet!

        I looked at the StressTestReorder when I got back home tonight and 1,000 iterations went through fine. Whew!

        I'll remove the extra synch in DUH2 and run the tests tonight to see what happens, along with some additional debugging in case we hit the null pointer exception again. Not a clue what's happening there.

        Show
        Erick Erickson added a comment - Yonik Seeley Glad to see you're seeing the same thing I saw, senility is not here yet! I looked at the StressTestReorder when I got back home tonight and 1,000 iterations went through fine. Whew! I'll remove the extra synch in DUH2 and run the tests tonight to see what happens, along with some additional debugging in case we hit the null pointer exception again. Not a clue what's happening there.
        Hide
        Yonik Seeley added a comment -

        The NPE failures are likely unrelated to any changes made thus far... I think it's probably related to the fact that "uhandler" can abruptly change in the UpdateLog on a reload, and can be used perhaps before it's ready? I haven't gone down that rabbit hole...

        Show
        Yonik Seeley added a comment - The NPE failures are likely unrelated to any changes made thus far... I think it's probably related to the fact that "uhandler" can abruptly change in the UpdateLog on a reload, and can be used perhaps before it's ready? I haven't gone down that rabbit hole...
        Hide
        Yonik Seeley added a comment -

        I ran a while with trunk, with removing the extra sync on the normal "add" case only, not the "addDelete" case.
        I hit another deadlock.
        This looks like the case Mark was looking for before... with a different thread holding the writer.

          2> "WRITER5" ID=23 BLOCKED on java.lang.Object@5764facb owned by "WRITER4" ID=22
          2>    at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:593)
          2>    - blocked on java.lang.Object@5764facb
          2>    at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
          2>    at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
          2>    at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1641)
          2>    at org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1618)
          2>    at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161)
          2>    at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270)
        [...]
          2> "WRITER4" ID=22 TIMED_WAITING on java.lang.Object@5a94059a
          2>    at java.lang.Object.wait(Native Method)
          2>    - timed waiting on java.lang.Object@5a94059a
          2>    at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:96)
          2>    at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1588)
          2>    at org.apache.solr.update.UpdateLog.add(UpdateLog.java:455)
          2>    - locked org.apache.solr.update.UpdateLog@5911991e
          2>    at org.apache.solr.update.DirectUpdateHandler2.addAndDelete(DirectUpdateHandler2.java:331)
          2>    - locked java.lang.Object@5764facb
          2>    at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:200)
          2>    at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164)
          2>    at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        [...]
          2> "TEST-TestReloadDeadlock.testReloadDeadlock-seed#[D4E455E167793E1]" ID=12 TIMED_WAITING on java.lang.Object@5a94059a
          2>    at java.lang.Object.wait(Native Method)
          2>    - timed waiting on java.lang.Object@5a94059a
          2>    at org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:156)
          2>    - locked org.apache.solr.update.DefaultSolrCoreState@1b88614d
          2>    at org.apache.solr.core.SolrCore.reload(SolrCore.java:479)
        

        Writer5 wants to do a commit

        • calls solrCoreState.getIndexWriter()
        • blocks waiting for updateLock

        Writer4 wants to do an addAndDelete

        • aquires updateLock
        • aquires UpdateLog.this
        • calls DefaultSolrCoreState.getIndexWriter and waits forever

        Main-test-thread wants to do a reload:

        • calls DefaultSolrCoreState.newIndexWriter and waits forever

        It feels like this type of deadlock can still be hit on trunk now unmodified?
        Perhaps the right solution was to just pass the IndexWriter down once you aquire it.

        Show
        Yonik Seeley added a comment - I ran a while with trunk, with removing the extra sync on the normal "add" case only, not the "addDelete" case. I hit another deadlock. This looks like the case Mark was looking for before... with a different thread holding the writer. 2> "WRITER5" ID=23 BLOCKED on java.lang. Object @5764facb owned by "WRITER4" ID=22 2> at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:593) 2> - blocked on java.lang. Object @5764facb 2> at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) 2> at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1641) 2> at org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1618) 2> at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161) 2> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270) [...] 2> "WRITER4" ID=22 TIMED_WAITING on java.lang. Object @5a94059a 2> at java.lang. Object .wait(Native Method) 2> - timed waiting on java.lang. Object @5a94059a 2> at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:96) 2> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1588) 2> at org.apache.solr.update.UpdateLog.add(UpdateLog.java:455) 2> - locked org.apache.solr.update.UpdateLog@5911991e 2> at org.apache.solr.update.DirectUpdateHandler2.addAndDelete(DirectUpdateHandler2.java:331) 2> - locked java.lang. Object @5764facb 2> at org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:200) 2> at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:164) 2> at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) [...] 2> "TEST-TestReloadDeadlock.testReloadDeadlock-seed#[D4E455E167793E1]" ID=12 TIMED_WAITING on java.lang. Object @5a94059a 2> at java.lang. Object .wait(Native Method) 2> - timed waiting on java.lang. Object @5a94059a 2> at org.apache.solr.update.DefaultSolrCoreState.newIndexWriter(DefaultSolrCoreState.java:156) 2> - locked org.apache.solr.update.DefaultSolrCoreState@1b88614d 2> at org.apache.solr.core.SolrCore.reload(SolrCore.java:479) Writer5 wants to do a commit calls solrCoreState.getIndexWriter() blocks waiting for updateLock Writer4 wants to do an addAndDelete aquires updateLock aquires UpdateLog.this calls DefaultSolrCoreState.getIndexWriter and waits forever Main-test-thread wants to do a reload: calls DefaultSolrCoreState.newIndexWriter and waits forever It feels like this type of deadlock can still be hit on trunk now unmodified? Perhaps the right solution was to just pass the IndexWriter down once you aquire it.
        Hide
        Yonik Seeley added a comment -

        Passing the indexWriter down would also fix the NPEs I think since then openReader wouldn't try to obtain it from the yet-to-be-finished new update handler?

        Show
        Yonik Seeley added a comment - Passing the indexWriter down would also fix the NPEs I think since then openReader wouldn't try to obtain it from the yet-to-be-finished new update handler?
        Hide
        Yonik Seeley added a comment -

        OK, I let the current unmodified trunk run overnight.
        I got 14 fails, 4 of which were deadlocks.

        Here's the only changes I made to the test:

        Index: solr/core/src/test/org/apache/solr/search/TestReloadDeadlock.java
        ===================================================================
        --- solr/core/src/test/org/apache/solr/search/TestReloadDeadlock.java	(revision 1695727)
        +++ solr/core/src/test/org/apache/solr/search/TestReloadDeadlock.java	(working copy)
        @@ -74,7 +74,7 @@
             int nWriteThreads = 5 + random().nextInt(10);
         
             // query variables
        -    final AtomicLong reloads = new AtomicLong(50);  // number of reloads. Increase this number to force failure.
        +    final AtomicLong reloads = new AtomicLong(200);  // number of reloads. Increase this number to force failure.
         
             ifVerbose("commitPercent", commitPercent, "deleteByQueryPercent", deleteByQueryPercent
                 , "ndocs", ndocs, "nWriteThreads", nWriteThreads, "reloads", reloads);
        @@ -177,7 +177,7 @@
         
             // The reload operation really doesn't need to happen from multiple threads, we just want it firing pretty often.
             while (reloads.get() > 0) {
        -      Thread.sleep(10 + random().nextInt(250));
        +      Thread.sleep(1 + random().nextInt(50));
               reloads.decrementAndGet();
               h.getCoreContainer().reload("collection1");
             }
        
        Show
        Yonik Seeley added a comment - OK, I let the current unmodified trunk run overnight. I got 14 fails, 4 of which were deadlocks. Here's the only changes I made to the test: Index: solr/core/src/test/org/apache/solr/search/TestReloadDeadlock.java =================================================================== --- solr/core/src/test/org/apache/solr/search/TestReloadDeadlock.java (revision 1695727) +++ solr/core/src/test/org/apache/solr/search/TestReloadDeadlock.java (working copy) @@ -74,7 +74,7 @@ int nWriteThreads = 5 + random().nextInt(10); // query variables - final AtomicLong reloads = new AtomicLong(50); // number of reloads. Increase this number to force failure. + final AtomicLong reloads = new AtomicLong(200); // number of reloads. Increase this number to force failure. ifVerbose( "commitPercent" , commitPercent, "deleteByQueryPercent" , deleteByQueryPercent , "ndocs" , ndocs, "nWriteThreads" , nWriteThreads, "reloads" , reloads); @@ -177,7 +177,7 @@ // The reload operation really doesn't need to happen from multiple threads, we just want it firing pretty often. while (reloads.get() > 0) { - Thread .sleep(10 + random().nextInt(250)); + Thread .sleep(1 + random().nextInt(50)); reloads.decrementAndGet(); h.getCoreContainer().reload( "collection1" ); }
        Hide
        Mark Miller added a comment -

        So: one can't call getIndexWriter() and then do anything else that will eventually call getIndexWriter() or newIndexWriter()
        If we keep that restriction, then moving the ulog.add outside of the getIndexWriter/release block was correct.

        Right - the expected use is to be careful and quickly get and release the writer so that you don't accidentally try and getwriter or newwriter in between. This is why I thought Erik's first fix attempt sounded right. But then I guess he started hitting the deadlock you are still getting and figured it was not correct. Seems like it is correct, but we also have to address this other issue.

        Show
        Mark Miller added a comment - So: one can't call getIndexWriter() and then do anything else that will eventually call getIndexWriter() or newIndexWriter() If we keep that restriction, then moving the ulog.add outside of the getIndexWriter/release block was correct. Right - the expected use is to be careful and quickly get and release the writer so that you don't accidentally try and getwriter or newwriter in between. This is why I thought Erik's first fix attempt sounded right. But then I guess he started hitting the deadlock you are still getting and figured it was not correct. Seems like it is correct, but we also have to address this other issue.
        Hide
        Mark Miller added a comment -

        Perhaps the right solution was to just pass the IndexWriter down once you aquire it.

        Passing the indexWriter down would also fix the NPEs I think since then openReader wouldn't try to obtain it from the yet-to-be-finished new update handler?

        Wasn't that Erik's first fix attempt?

        I think the solution is to just pass the IndexWriter down to addAndDelete, but won't have time to really look until this evening.

        Show
        Mark Miller added a comment - Perhaps the right solution was to just pass the IndexWriter down once you aquire it. Passing the indexWriter down would also fix the NPEs I think since then openReader wouldn't try to obtain it from the yet-to-be-finished new update handler? Wasn't that Erik's first fix attempt? I think the solution is to just pass the IndexWriter down to addAndDelete, but won't have time to really look until this evening.
        Hide
        Yonik Seeley added a comment -

        Wasn't that Erik's first fix attempt?

        Yeah, I was essentially saying that seems to be the correct approach.

        Show
        Yonik Seeley added a comment - Wasn't that Erik's first fix attempt? Yeah, I was essentially saying that seems to be the correct approach.
        Hide
        Mark Miller added a comment -

        Hmm, I wonder what was still not working about it yet. I should be able to duplicate this too now, I'll take a look at the stack traces with just that change.

        Show
        Mark Miller added a comment - Hmm, I wonder what was still not working about it yet. I should be able to duplicate this too now, I'll take a look at the stack traces with just that change.
        Hide
        Yonik Seeley added a comment -

        At some point in time, it seems like everyone develops some sort of script to run a test over and over again...
        I really should have put this up years ago:
        https://github.com/yonik/misc/blob/master/scripts/test.sh

        Show
        Yonik Seeley added a comment - At some point in time, it seems like everyone develops some sort of script to run a test over and over again... I really should have put this up years ago: https://github.com/yonik/misc/blob/master/scripts/test.sh
        Hide
        Erick Erickson added a comment - - edited

        Mark Miller If you check out trunk revision 1694809, then the stack trace above in the deadlock_5_pass_iw.res.zip is the problem I hit by passing the indexwriter down to DUH2.addanddelete. It really doesn't help because updatelog.add has this bit of code:

                try {
                  RefCounted e1 = this.uhandler.core.openNewSearcher(true, true);
                  e1.decref();
                } catch (Exception var8) {
                  SolrException.log(log, "Error opening realtime searcher for deleteByQuery", var8);
                }
        

        so the IW passed to addAndDelete in DUH2 doesn't make a difference, the fact that the IW is held open is blocking the call to openNewSearcher above; it doesn't matter whether addAndDelete has an already opened IW passed in or opens its own, as long as the IW is open across the call to ulog.add(), a deadlock occurs. Sometimes.

        Based on the exception message, would a better approach be to open up the new searcher in delete by query instead? Would that cover all the cases? I'll look a little here. Then I could move all the ulog.add()s back inside the associated IW and again tightly couple the document adds with updating the ulog.

        Or maybe the answer is making realtime searchers sensitive to whether anything's been indexed since the last time it was opened rather than pre-opening a new one in addDoc0(). Don't even know if it's possible, haven't looked yet. Essentially this would make opening a realtime searcher "lazy". Which I can argue is a better solution than adding this overhead to all updates, although it also seems it would make realtime gets vary a lot depending on whether they had to open a searcher or not.

        Show
        Erick Erickson added a comment - - edited Mark Miller If you check out trunk revision 1694809, then the stack trace above in the deadlock_5_pass_iw.res.zip is the problem I hit by passing the indexwriter down to DUH2.addanddelete. It really doesn't help because updatelog.add has this bit of code: try { RefCounted e1 = this .uhandler.core.openNewSearcher( true , true ); e1.decref(); } catch (Exception var8) { SolrException.log(log, "Error opening realtime searcher for deleteByQuery" , var8); } so the IW passed to addAndDelete in DUH2 doesn't make a difference, the fact that the IW is held open is blocking the call to openNewSearcher above; it doesn't matter whether addAndDelete has an already opened IW passed in or opens its own, as long as the IW is open across the call to ulog.add(), a deadlock occurs. Sometimes. Based on the exception message, would a better approach be to open up the new searcher in delete by query instead? Would that cover all the cases? I'll look a little here. Then I could move all the ulog.add()s back inside the associated IW and again tightly couple the document adds with updating the ulog. Or maybe the answer is making realtime searchers sensitive to whether anything's been indexed since the last time it was opened rather than pre-opening a new one in addDoc0(). Don't even know if it's possible, haven't looked yet. Essentially this would make opening a realtime searcher "lazy". Which I can argue is a better solution than adding this overhead to all updates, although it also seems it would make realtime gets vary a lot depending on whether they had to open a searcher or not.
        Hide
        Erick Erickson added a comment -

        Poking a little more, opening a new searcher in add happens only when clearCaches==true, which only happens explicitly in DUH2.addAndDelete which is where all this started. There's also a call in the CDCR code that passes a variable in, but I don't think that's really relevant.

        It's simple enough to move opening a new searcher up to these two places, I'll give it a try to evaluate. I don't like that solution much since it's trappy; a new call to add(cmd, true) that fails to open a new searcher could re-introduce the problem that opening that searcher where it's done now is designed to prevent. I suppose a big fat warning is in order?

        Let me try it just to see whether it cures things or not. I'm pretty sure it'll cure the deadlock problem, I'll first try to just comment out the openSearcher and see if I can blow up the real time get tests, then move the open out and see if either realtime get tests or the new deadlock test fail with the reorganized code. When I collect that data we can discuss some more. Probably have something later today.

        Yonik Seeley Those numbers in the new test were chosen completely arbitrarily, I'm guessing that the point of your changes is to drive the failure more often without lengthening the time the test takes, so I'll incorporate them.

        Show
        Erick Erickson added a comment - Poking a little more, opening a new searcher in add happens only when clearCaches==true, which only happens explicitly in DUH2.addAndDelete which is where all this started. There's also a call in the CDCR code that passes a variable in, but I don't think that's really relevant. It's simple enough to move opening a new searcher up to these two places, I'll give it a try to evaluate. I don't like that solution much since it's trappy; a new call to add(cmd, true) that fails to open a new searcher could re-introduce the problem that opening that searcher where it's done now is designed to prevent. I suppose a big fat warning is in order? Let me try it just to see whether it cures things or not. I'm pretty sure it'll cure the deadlock problem, I'll first try to just comment out the openSearcher and see if I can blow up the real time get tests, then move the open out and see if either realtime get tests or the new deadlock test fail with the reorganized code. When I collect that data we can discuss some more. Probably have something later today. Yonik Seeley Those numbers in the new test were chosen completely arbitrarily, I'm guessing that the point of your changes is to drive the failure more often without lengthening the time the test takes, so I'll incorporate them.
        Hide
        Erick Erickson added a comment -

        This patch should be applied after SOLR-7836.patch if anyone is back-porting

        Here's a new patch for comment that

        • puts the ulog writes back inside the IW blocks
        • pulls out the problematic open searcher in ulog.add to a separate method.
        • calls the extracted method from the two places that could call UpdateLog.add(cmd, true), which was the condition for opening a new searcher. The calls to the new method must be outside the IW block.
        • removes the extra synchronized blocks on solrCoreState.getUpdatelock()
        • changes the test to hit this condition harder as per Yonik.

        It's possible that the CDCR code calls ulog.add with clearCaches==true, in which case the extracted method in ulog is called. Frankly I doubt that's a necessary thing, but seems harmless.

        I don't particularly like decoupling the open searcher from the updatelog.add, but I like lockups even less. Not to mention possible tlog craziness. So I'll live with my dislike.

        I think this addresses concerns about the tlog synchronization.

        I ran this last night for 360 iterations, then made some trivial changes (yeah, right). I'll try some beasting on this today plus StressTestReorder, then do the usual precommit and full test. Assuming all that goes well I'll probably check this in tomorrow and call this done unless there are objections.

        This, coupled with Yoniks changes for the NPE should put this to bed.

        Mark Miller Yonik Seeley all comments welcome of course.

        Show
        Erick Erickson added a comment - This patch should be applied after SOLR-7836 .patch if anyone is back-porting Here's a new patch for comment that puts the ulog writes back inside the IW blocks pulls out the problematic open searcher in ulog.add to a separate method. calls the extracted method from the two places that could call UpdateLog.add(cmd, true), which was the condition for opening a new searcher. The calls to the new method must be outside the IW block. removes the extra synchronized blocks on solrCoreState.getUpdatelock() changes the test to hit this condition harder as per Yonik. It's possible that the CDCR code calls ulog.add with clearCaches==true, in which case the extracted method in ulog is called. Frankly I doubt that's a necessary thing, but seems harmless. I don't particularly like decoupling the open searcher from the updatelog.add, but I like lockups even less. Not to mention possible tlog craziness. So I'll live with my dislike. I think this addresses concerns about the tlog synchronization. I ran this last night for 360 iterations, then made some trivial changes (yeah, right). I'll try some beasting on this today plus StressTestReorder, then do the usual precommit and full test. Assuming all that goes well I'll probably check this in tomorrow and call this done unless there are objections. This, coupled with Yoniks changes for the NPE should put this to bed. Mark Miller Yonik Seeley all comments welcome of course.
        Hide
        Yonik Seeley added a comment -

        pulls out the problematic open searcher in ulog.add to a separate method.

        There are a few areas with complex synchronization that should not be changed unless one is confident about understanding why all the synchronization was there in the first place. Having the tests pass isn't a high enough bar for these areas because of the difficulty in actually getting a test to expose subtle race conditions or thread safety issues. This comes back to my original "get it back in my head" - I don't fee comfortable messing with this stuff either until I've really internalized the bigger picture again... and it doesn't last

        For the specific case above, one can't just take what was one synchronized block and break it up into two. It certainly creates race conditions and breaks the invariants we try to keep. The specific invariant here is that if it's not in the tlog maps, then it is guaranteed to be in the realtime reader. Hopefully some of our tests would fail with this latest patch... but it's hard stuff to test.

        I worked up a patch that passed down the IndexWriter (it needs to be passed all the way down to SolrCore.openSearcher to actually avoid deadlocks). That ended up changing more code than I'd like... so now I'm working up a patch to make IW locking re-entrant. That approach should be less fragile going forward (i.e. less likely to easily introduce a deadlock through seemingly unrelated changes).

        Show
        Yonik Seeley added a comment - pulls out the problematic open searcher in ulog.add to a separate method. There are a few areas with complex synchronization that should not be changed unless one is confident about understanding why all the synchronization was there in the first place. Having the tests pass isn't a high enough bar for these areas because of the difficulty in actually getting a test to expose subtle race conditions or thread safety issues. This comes back to my original "get it back in my head" - I don't fee comfortable messing with this stuff either until I've really internalized the bigger picture again... and it doesn't last For the specific case above, one can't just take what was one synchronized block and break it up into two. It certainly creates race conditions and breaks the invariants we try to keep. The specific invariant here is that if it's not in the tlog maps, then it is guaranteed to be in the realtime reader. Hopefully some of our tests would fail with this latest patch... but it's hard stuff to test. I worked up a patch that passed down the IndexWriter (it needs to be passed all the way down to SolrCore.openSearcher to actually avoid deadlocks). That ended up changing more code than I'd like... so now I'm working up a patch to make IW locking re-entrant. That approach should be less fragile going forward (i.e. less likely to easily introduce a deadlock through seemingly unrelated changes).
        Hide
        Yonik Seeley added a comment -

        I tried applying the last patch and running TestStressReorder and luckily it does fail often for me.

        Show
        Yonik Seeley added a comment - I tried applying the last patch and running TestStressReorder and luckily it does fail often for me.
        Hide
        Erick Erickson added a comment -

        bq: I tried applying the last patch and running TestStressReorder and luckily it does fail often for me.

        Yep, I saw that last night. I looked a bit at whether it was a test artifact and apparently it's not so I was going to dive into that today.

        Anyway, since you're working up alternatives, I'll leave it to you. The current checkin (not the latest patch which I won't commit) at least avoids the deadlock that started me down this path in the first place. Whether it creates other issues is, of course, the $64K question. Let me know if there's anything I can do besides cheer you on.

        Show
        Erick Erickson added a comment - bq: I tried applying the last patch and running TestStressReorder and luckily it does fail often for me. Yep, I saw that last night. I looked a bit at whether it was a test artifact and apparently it's not so I was going to dive into that today. Anyway, since you're working up alternatives, I'll leave it to you. The current checkin (not the latest patch which I won't commit) at least avoids the deadlock that started me down this path in the first place. Whether it creates other issues is, of course, the $64K question. Let me know if there's anything I can do besides cheer you on.
        Hide
        Yonik Seeley added a comment - - edited

        OK, here's a patch that uses a reentrant read-write lock in DefaultSolrCoreState.

        Notes:

        • no more pauseWriter / freeWriter variables
        • readlock is for grabbing the current writer, writeLock is for ensuring there are no readers
        • the RefCounted<IndexWriter> doesn't do much any more... decRef releases the readLock, that's the only important thing. I just kept it to keep the API the same for now.
        • all of newIndexWriter, closeIndexWriter, and openIndexWriter all delegate to changeIndexWriter to remove duplicated code. This does change the semantics of some of these methods, but I hope for the better?
          • The old openIndexWriter method only synchronized on writerPauseLock, so if there were any readers and writers in the lock loop, or even readers open, this method could go in and still change the writer anyway. Sharing the code in changeWriter now means that it would wait for any readers to finish (which looking at the comments implies there should not be anyway? and if there were, new IW creation would fail), close any existing old writer, and then open the new one.
        • I reordered some things so that if we got an exception while opening a new writer, the writer reference would already be cleared so we won't simply default to using the old (presumably now closed) writer, but will try to open again. Should be more resilient in the face of a spurious exception.
        • it's really not clear if polling is still needed in getIndexWriter, but I kept it for now just to be safe.

        looping some tests now...

        Show
        Yonik Seeley added a comment - - edited OK, here's a patch that uses a reentrant read-write lock in DefaultSolrCoreState. Notes: no more pauseWriter / freeWriter variables readlock is for grabbing the current writer, writeLock is for ensuring there are no readers the RefCounted<IndexWriter> doesn't do much any more... decRef releases the readLock, that's the only important thing. I just kept it to keep the API the same for now. all of newIndexWriter, closeIndexWriter, and openIndexWriter all delegate to changeIndexWriter to remove duplicated code. This does change the semantics of some of these methods, but I hope for the better? The old openIndexWriter method only synchronized on writerPauseLock, so if there were any readers and writers in the lock loop, or even readers open, this method could go in and still change the writer anyway. Sharing the code in changeWriter now means that it would wait for any readers to finish (which looking at the comments implies there should not be anyway? and if there were, new IW creation would fail), close any existing old writer, and then open the new one. I reordered some things so that if we got an exception while opening a new writer, the writer reference would already be cleared so we won't simply default to using the old (presumably now closed) writer, but will try to open again. Should be more resilient in the face of a spurious exception. it's really not clear if polling is still needed in getIndexWriter, but I kept it for now just to be safe. looping some tests now...
        Hide
        Erick Erickson added a comment -

        I'll beast the stressdeadlock and stresstestreorder for a while as well.

        Thanks!

        Show
        Erick Erickson added a comment - I'll beast the stressdeadlock and stresstestreorder for a while as well. Thanks!
        Hide
        Erick Erickson added a comment -

        FWIW, 96 runs each (look, 96 divides by 6 processors evenly, OK?) and both StressTestReorder and TestReloadDeadlock seem happy, along with precommit. Running full test suite now.

        Show
        Erick Erickson added a comment - FWIW, 96 runs each (look, 96 divides by 6 processors evenly, OK?) and both StressTestReorder and TestReloadDeadlock seem happy, along with precommit. Running full test suite now.
        Hide
        Erick Erickson added a comment -

        Yonik Seeley Since you've done the heavy lifting here and I've beasted both the the tests, and a full test also worked, do you want me to just go ahead and check it in? Or do you want to polish it yet?

        Show
        Erick Erickson added a comment - Yonik Seeley Since you've done the heavy lifting here and I've beasted both the the tests, and a full test also worked, do you want me to just go ahead and check it in? Or do you want to polish it yet?
        Hide
        Yonik Seeley added a comment -

        I'm running some more tests and will do a little more polishing on comments...

        Show
        Yonik Seeley added a comment - I'm running some more tests and will do a little more polishing on comments...
        Hide
        Yonik Seeley added a comment -

        I've been running ChaosMonkeySafeLeaderTest for about 3 days with my test script that also searches for corrupt indexes or assertion failures even when the test still passes.
        Current trunk (as of last week): 9 corrupt indexes
        Patched trunk: 14 corrupt indexes and 2 test failures (inconsistent shards)

        The corrupt indexes may not be a problem, I don't really know. We kill off servers, perhaps during replication? Seems like that could produce corrupt indexes, but I don't know if that's the scenario or not. Increasing the incidence of those doesn't necessarily point to a problem either. But inconsistent shards vs not... does seem like a problem if it holds.

        I've reviewed the locking code again, and it looks solid, so I'm not sure what's going on.

        Here's a typical corrupt index trace:

          2> 21946 WARN  (RecoveryThread-collection1) [n:127.0.0.1:51815_ c:collection1 s:shard1 r:core_node2 x:collection1] o.a.s.h.IndexFetcher Could not retrie
        ve checksum from file.
          2> org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=1698720114 vs expected footer=-1071082520 (resource=MMapIndexInput(path="/opt/code/lusolr_clean2/solr/build/solr-core/test/J0/temp/solr.cloud.ChaosMonkeySafeLeaderTest_B7DC9C42462BF20D-001/shard-2-001/cores/collection1/data/index/_0.fdt"))
          2>    at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:416)
          2>    at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:401)
          2>    at org.apache.solr.handler.IndexFetcher.compareFile(IndexFetcher.java:876)
          2>    at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:839)
          2>    at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:437)
          2>    at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:265)
          2>    at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:382)
          2>    at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162)
          2>    at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437)
          2>    at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
        
        Show
        Yonik Seeley added a comment - I've been running ChaosMonkeySafeLeaderTest for about 3 days with my test script that also searches for corrupt indexes or assertion failures even when the test still passes. Current trunk (as of last week): 9 corrupt indexes Patched trunk: 14 corrupt indexes and 2 test failures (inconsistent shards) The corrupt indexes may not be a problem, I don't really know. We kill off servers, perhaps during replication? Seems like that could produce corrupt indexes, but I don't know if that's the scenario or not. Increasing the incidence of those doesn't necessarily point to a problem either. But inconsistent shards vs not... does seem like a problem if it holds. I've reviewed the locking code again, and it looks solid, so I'm not sure what's going on. Here's a typical corrupt index trace: 2> 21946 WARN (RecoveryThread-collection1) [n:127.0.0.1:51815_ c:collection1 s:shard1 r:core_node2 x:collection1] o.a.s.h.IndexFetcher Could not retrie ve checksum from file. 2> org.apache.lucene.index.CorruptIndexException: codec footer mismatch (file truncated?): actual footer=1698720114 vs expected footer=-1071082520 (resource=MMapIndexInput(path= "/opt/code/lusolr_clean2/solr/build/solr-core/test/J0/temp/solr.cloud.ChaosMonkeySafeLeaderTest_B7DC9C42462BF20D-001/shard-2-001/cores/collection1/data/index/_0.fdt" )) 2> at org.apache.lucene.codecs.CodecUtil.validateFooter(CodecUtil.java:416) 2> at org.apache.lucene.codecs.CodecUtil.retrieveChecksum(CodecUtil.java:401) 2> at org.apache.solr.handler.IndexFetcher.compareFile(IndexFetcher.java:876) 2> at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:839) 2> at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:437) 2> at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:265) 2> at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:382) 2> at org.apache.solr.cloud.RecoveryStrategy.replicate(RecoveryStrategy.java:162) 2> at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:437) 2> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:227)
        Hide
        Yonik Seeley added a comment -

        I'm trying some different variants of my patch, trying to pin down exactly what change in semantics are causing the fails. It takes some time though... I normally need to let it loop for a day to get failures.

        Show
        Yonik Seeley added a comment - I'm trying some different variants of my patch, trying to pin down exactly what change in semantics are causing the fails. It takes some time though... I normally need to let it loop for a day to get failures.
        Hide
        Yonik Seeley added a comment -

        OK, I found the issue.
        When I re-wrote the SolrCoreState locking to use reentrant read-write locks, I was using the current trunk as a reference. Unfortunately, a previous commit in this issue had introduced a bug by changing the semantics. I didn't realize it until I went back to look at trunk before any commits on this issue.

        The bug was the addition of this to closeIndexWriter:

             } finally {
                pauseWriter = false;
                writerPauseLock.notifyAll();
              }
        

        closeIndexWriter / newIndexWriter are meant to be used as a pair, and block any readers inbetween them being called.

        I've been looping ChaosMonkeySafeLeaderTest since yesterday afternoon and no fails yet.

        Show
        Yonik Seeley added a comment - OK, I found the issue. When I re-wrote the SolrCoreState locking to use reentrant read-write locks, I was using the current trunk as a reference. Unfortunately, a previous commit in this issue had introduced a bug by changing the semantics. I didn't realize it until I went back to look at trunk before any commits on this issue. The bug was the addition of this to closeIndexWriter: } finally { pauseWriter = false ; writerPauseLock.notifyAll(); } closeIndexWriter / newIndexWriter are meant to be used as a pair, and block any readers inbetween them being called. I've been looping ChaosMonkeySafeLeaderTest since yesterday afternoon and no fails yet.
        Hide
        Mark Miller added a comment -

        The bug was the addition of this to closeIndexWriter:

        Ouch, lucky to notice that. Shows how careful we have to be when touching update side synchronization. Quick commits can quickly get lost and introduce hard to track down bugs.

        Show
        Mark Miller added a comment - The bug was the addition of this to closeIndexWriter: Ouch, lucky to notice that. Shows how careful we have to be when touching update side synchronization. Quick commits can quickly get lost and introduce hard to track down bugs.
        Hide
        Erick Erickson added a comment -

        Hmmm, looks like I screwed the pooch on that one. Siiiggggghhhhh.

        Show
        Erick Erickson added a comment - Hmmm, looks like I screwed the pooch on that one. Siiiggggghhhhh.
        Hide
        Erick Erickson added a comment -

        Worth merging into 5.3 in case a 5.3.1 gets cut?

        Show
        Erick Erickson added a comment - Worth merging into 5.3 in case a 5.3.1 gets cut?
        Hide
        ASF subversion and git services added a comment -

        Commit 1701043 from Yonik Seeley in branch 'dev/trunk'
        [ https://svn.apache.org/r1701043 ]

        SOLR-7836: change SolrCoreState to use a reentrant locking implementation to fix deadlocks

        Show
        ASF subversion and git services added a comment - Commit 1701043 from Yonik Seeley in branch 'dev/trunk' [ https://svn.apache.org/r1701043 ] SOLR-7836 : change SolrCoreState to use a reentrant locking implementation to fix deadlocks
        Hide
        ASF subversion and git services added a comment -

        Commit 1701044 from Yonik Seeley in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1701044 ]

        SOLR-7836: change SolrCoreState to use a reentrant locking implementation to fix deadlocks

        Show
        ASF subversion and git services added a comment - Commit 1701044 from Yonik Seeley in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1701044 ] SOLR-7836 : change SolrCoreState to use a reentrant locking implementation to fix deadlocks
        Hide
        Yonik Seeley added a comment -

        Not sure if there will even be a 5.3.1 release.
        If there is, perhaps everything that was committed as part of this issue should go back? Would be nice to let it bake a while longer in 5x/trunk just to make sure that nothing else is broken.

        Show
        Yonik Seeley added a comment - Not sure if there will even be a 5.3.1 release. If there is, perhaps everything that was committed as part of this issue should go back? Would be nice to let it bake a while longer in 5x/trunk just to make sure that nothing else is broken.
        Hide
        Erick Erickson added a comment -

        Yonik Seeley] re: back-porting to 5.3.1. Never mind...

        I just checked to be sure. Noble cut the 5.3 branch 3 days before my first checkin for this ticket. If the problems you found with that check in were in the release a back-port was in order. But since they're not, I see no need.

        Show
        Erick Erickson added a comment - Yonik Seeley ] re: back-porting to 5.3.1. Never mind... I just checked to be sure. Noble cut the 5.3 branch 3 days before my first checkin for this ticket. If the problems you found with that check in were in the release a back-port was in order. But since they're not, I see no need.

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Erick Erickson
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development