Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-17118

Solr deadlock during servlet container start

Details

    Description

      In rare cases, Solr can run into a deadlock when started. The servlet container startup thread gets blocked and there's no other thread that could unblock it:

      "main" #1 prio=5 os_prio=0 cpu=5922.39ms elapsed=7490.27s tid=0x00007f637402ae70 nid=0x47 waiting on condition [0x00007f6379488000]
         java.lang.Thread.State: WAITING (parking)
          at jdk.internal.misc.Unsafe.park(java.base@17.0.9/Native Method)
          - parking to wait for  <0x0000000081da8000> (a java.util.concurrent.CountDownLatch$Sync)
          at java.util.concurrent.locks.LockSupport.park(java.base@17.0.9/Unknown Source)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.9/Unknown Source)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.9/Unknown Source)
          at java.util.concurrent.CountDownLatch.await(java.base@17.0.9/Unknown Source)
          at org.apache.solr.servlet.CoreContainerProvider$ContextInitializationKey.waitForReadyService(CoreContainerProvider.java:523)
          at org.apache.solr.servlet.CoreContainerProvider$ServiceHolder.getService(CoreContainerProvider.java:562)
          at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:148)
          at org.eclipse.jetty.servlet.FilterHolder.initialize(FilterHolder.java:133)
          at org.eclipse.jetty.servlet.ServletHandler.lambda$initialize$2(ServletHandler.java:725)
          at org.eclipse.jetty.servlet.ServletHandler$$Lambda$315/0x00007f62fc2674b8.accept(Unknown Source)
          at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(java.base@17.0.9/Unknown Source)
          at java.util.stream.Streams$ConcatSpliterator.forEachRemaining(java.base@17.0.9/Unknown Source)
          at java.util.stream.ReferencePipeline$Head.forEach(java.base@17.0.9/Unknown Source)
          at org.eclipse.jetty.servlet.ServletHandler.initialize(ServletHandler.java:749)
          at org.eclipse.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:392) 
      

      ContextInitializationKey.waitForReadyService should have been unblocked by CoreContainerProvider#init, which is calling ServiceHolder#setService. This should work because CoreContainerProvider#init is always called before SolrDispatchFilter#init (ServletContextListeners are initialized before Filters).

      But there's a problem: CoreContainerProvider#init stores the ContextInitializationKey and the mapped ServiceHolder in CoreContainerProvider#services, and that's a WeakHashMap:

            services 
                .computeIfAbsent(new ContextInitializationKey(servletContext), ServiceHolder::new) 
                .setService(this); 
      

      The key is not referenced anywhere else, which makes the mapping a candidate for garbage collection. The ServiceHolder value also does not reference the key anymore, because #setService cleared the reference.

      With bad luck, the mapping is already gone from the WeakHashMap before SolrDispatchFilter#init tries to retrieve it with CoreContainerProvider#serviceForContext. And that method will then create a new ContextInitializationKey and ServiceHolder, which is then used for #waitForReadyService. But such a new ContextInitializationKey has never received a #makeReady call, and #waitForReadyService will block forever.

      Attachments

        Issue Links

          Activity

            ahubold Andreas Hubold added a comment -

            I'd think this is caused by changes for SOLR-15590.

            Maybe it's even the same problem as the one that's causing SOLR-16086?

            ahubold Andreas Hubold added a comment - I'd think this is caused by changes for SOLR-15590 . Maybe it's even the same problem as the one that's causing SOLR-16086 ?
            dsmiley David Smiley added a comment -

            I've been stumped a number of times over the years looking at an unreproducible test failure that shows SolrDispatchFilter.init calling CoreContainerProvider.waitForReadyService and waiting on a CoundDownLatch indefinitely. Meanwhile, no other trace or problems in the logs. Eventually the test will timeout and we see a thread dump.

            I suspect a timing bug of exactly when GC happens interplaying with the use of WeakHashMap. In particular I see ContextInitializationKey's constructor publishing "this" to the ServletContext which seems like a bad place to put such logic (constructors publishing themselves is suspicious in general; avoid it). But the point is that it'll overwrite an existing entry in the context that may very well be there, thus suddenly making an existing entry in a WeakHashMap weakly reachable and it may be removed. There is too much complexity there; I think it should be overhauled a bit.

            dsmiley David Smiley added a comment - I've been stumped a number of times over the years looking at an unreproducible test failure that shows SolrDispatchFilter.init calling CoreContainerProvider.waitForReadyService and waiting on a CoundDownLatch indefinitely. Meanwhile, no other trace or problems in the logs. Eventually the test will timeout and we see a thread dump. I suspect a timing bug of exactly when GC happens interplaying with the use of WeakHashMap. In particular I see ContextInitializationKey's constructor publishing "this" to the ServletContext which seems like a bad place to put such logic (constructors publishing themselves is suspicious in general; avoid it). But the point is that it'll overwrite an existing entry in the context that may very well be there, thus suddenly making an existing entry in a WeakHashMap weakly reachable and it may be removed. There is too much complexity there ; I think it should be overhauled a bit.
            gus Gus Heck added a comment -

            Sigh...

            I spent a bunch of time trying to eliminate deadlock/race stuff in this. Seems like something snuck through anyway. If it can be simplified certainly that's great, but IIRC (it's been a while) the complexity came from avoiding deadlocks, and from the mismatch between how our tests startup vs the way we actually start up as a server (the tests use JettySolrRunner to skip a bunch of standard web container stuff to speed up tests)

            There's also an outside chance that there's some timing difference these days, due to Jetty 10's move to EventListener, vs LifecycleListner but that's got no proof and is pure speculation.

            gus Gus Heck added a comment - Sigh... I spent a bunch of time trying to eliminate deadlock/race stuff in this. Seems like something snuck through anyway. If it can be simplified certainly that's great, but IIRC (it's been a while) the complexity came from avoiding deadlocks, and from the mismatch between how our tests startup vs the way we actually start up as a server (the tests use JettySolrRunner to skip a bunch of standard web container stuff to speed up tests) There's also an outside chance that there's some timing difference these days, due to Jetty 10's move to EventListener, vs LifecycleListner but that's got no proof and is pure speculation.
            dsmiley David Smiley added a comment -

            The problem strikes again. As long as this PR of mine seems pretty stable, I'm inclined to merge it to main and then monitor builds closely for spooky stuff. It's better than the current alternative.

            dsmiley David Smiley added a comment - The problem strikes again . As long as this PR of mine seems pretty stable, I'm inclined to merge it to main and then monitor builds closely for spooky stuff. It's better than the current alternative.

            Commit 1177796e32631c62d8f00e7df4341c92b75e1617 in solr's branch refs/heads/main from David Smiley
            [ https://gitbox.apache.org/repos/asf?p=solr.git;h=1177796e326 ]

            SOLR-17118: Simplify/fix CoreContainerProvider initialization (#2474)

            Thereby fixing a rare occurrence of Solr hanging on startup.

            CoreContainerProvider: Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey. No looping to wait for initialization.

            JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way – add all this before starting, not after. Thus Jetty will shut them down properly so we don't have to. Removed some needless synchronized wait/notify and other needless stuff.

            HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed.

            jira-bot ASF subversion and git services added a comment - Commit 1177796e32631c62d8f00e7df4341c92b75e1617 in solr's branch refs/heads/main from David Smiley [ https://gitbox.apache.org/repos/asf?p=solr.git;h=1177796e326 ] SOLR-17118 : Simplify/fix CoreContainerProvider initialization (#2474) Thereby fixing a rare occurrence of Solr hanging on startup. CoreContainerProvider: Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey. No looping to wait for initialization. JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way – add all this before starting, not after. Thus Jetty will shut them down properly so we don't have to. Removed some needless synchronized wait/notify and other needless stuff. HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed.

            Commit 53a2a36df2fb843df640ce51ead4a6e0f65d8e74 in solr's branch refs/heads/branch_9x from David Smiley
            [ https://gitbox.apache.org/repos/asf?p=solr.git;h=53a2a36df2f ]

            SOLR-17118: Simplify/fix CoreContainerProvider initialization (#2474)

            Thereby fixing a rare occurrence of Solr hanging on startup.

            CoreContainerProvider: Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey. No looping to wait for initialization.

            JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way – add all this before starting, not after. Thus Jetty will shut them down properly so we don't have to. Removed some needless synchronized wait/notify and other needless stuff.

            HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed.

            (cherry picked from commit 1177796e32631c62d8f00e7df4341c92b75e1617)

            jira-bot ASF subversion and git services added a comment - Commit 53a2a36df2fb843df640ce51ead4a6e0f65d8e74 in solr's branch refs/heads/branch_9x from David Smiley [ https://gitbox.apache.org/repos/asf?p=solr.git;h=53a2a36df2f ] SOLR-17118 : Simplify/fix CoreContainerProvider initialization (#2474) Thereby fixing a rare occurrence of Solr hanging on startup. CoreContainerProvider: Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey. No looping to wait for initialization. JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way – add all this before starting, not after. Thus Jetty will shut them down properly so we don't have to. Removed some needless synchronized wait/notify and other needless stuff. HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed. (cherry picked from commit 1177796e32631c62d8f00e7df4341c92b75e1617)

            Commit 53a2a36df2fb843df640ce51ead4a6e0f65d8e74 in solr's branch refs/heads/backport_SOLR-16842_to_branch_9x from David Smiley
            [ https://gitbox.apache.org/repos/asf?p=solr.git;h=53a2a36df2f ]

            SOLR-17118: Simplify/fix CoreContainerProvider initialization (#2474)

            Thereby fixing a rare occurrence of Solr hanging on startup.

            CoreContainerProvider: Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey. No looping to wait for initialization.

            JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way – add all this before starting, not after. Thus Jetty will shut them down properly so we don't have to. Removed some needless synchronized wait/notify and other needless stuff.

            HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed.

            (cherry picked from commit 1177796e32631c62d8f00e7df4341c92b75e1617)

            jira-bot ASF subversion and git services added a comment - Commit 53a2a36df2fb843df640ce51ead4a6e0f65d8e74 in solr's branch refs/heads/backport_ SOLR-16842 _to_branch_9x from David Smiley [ https://gitbox.apache.org/repos/asf?p=solr.git;h=53a2a36df2f ] SOLR-17118 : Simplify/fix CoreContainerProvider initialization (#2474) Thereby fixing a rare occurrence of Solr hanging on startup. CoreContainerProvider: Don't need any CountDownLatch (x2), the synchronized WeakHashMap of "services", the ServiceHolder, the ContextInitializationKey. No looping to wait for initialization. JettySolrRunner: incorporate the CoreContainerProvider and various servlet filters in a normal way – add all this before starting, not after. Thus Jetty will shut them down properly so we don't have to. Removed some needless synchronized wait/notify and other needless stuff. HealthCheckHandlerTest was shutting down CoreContainer improperly; this subset of the test was removed. (cherry picked from commit 1177796e32631c62d8f00e7df4341c92b75e1617)
            anshum Anshum Gupta added a comment -

            Closing after the 9.7.0 release

            anshum Anshum Gupta added a comment - Closing after the 9.7.0 release

            People

              dsmiley David Smiley
              ahubold Andreas Hubold
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m