Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-17060

CoreContainer#create may deadlock with concurrent requests for metrics

    XMLWordPrintableJSON

Details

    Description

      CoreContainer#create registers metrics for the created core quite early, and metrics can be requested for a core that is still being created. If a metrics request calls SolrCore#getSearcher with unlucky timing (race condition), then CoreContainer#create can deadlock.

      This problem was described on the users lists: https://lists.apache.org/thread/mvpp1ogkxfdgfx87mdt6ylhqsttoq2dw

      We ran into such a deadlock with Solr 9.2.1, a CoreAdmin CREATE request, and a periodic thread that requests all available JMX metrics.

      A CoreAdmin CREATE request was received, but its thread waits forever because onDeckSearchers is 1 and _searcher is null (variables checked in heap dump):

      "qtp1800649922-28" #28 prio=5 os_prio=0 cpu=757.69ms elapsed=25383.63s tid=0x00007fe6dc9613a0 nid=0x70 in Object.wait()  [0x00007fe6c68ca000]
         java.lang.Thread.State: WAITING (on object monitor)
          at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
          - waiting on <no object reference available>
          at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
          at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
          - locked <0x00000000e25dd2c8> (a java.lang.Object)
          at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1283)
          at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1168)
          at org.apache.solr.core.SolrCore.<init>(SolrCore.java:1051)
          at org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1666)
          at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1532)
          at org.apache.solr.handler.admin.CoreAdminOperation.lambda$static$0(CoreAdminOperation.java:111)
          at org.apache.solr.handler.admin.CoreAdminOperation$$Lambda$437/0x00007fe66048cc60.execute(Unknown Source)
          at org.apache.solr.handler.admin.CoreAdminOperation.execute(CoreAdminOperation.java:398)
          at org.apache.solr.handler.admin.CoreAdminHandler$CallInfo.call(CoreAdminHandler.java:354)
          at org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:219)
      

       
      There's no way for clients to retry the CREATE request, because CoreContainer rejects additional requests for good reason ("Already creating a core with name ...", see SOLR-14969).
      The only solution is to restart Solr.
       
      A thread for metrics was also waiting in the same way:

      prometheus-http-1-1" #30 daemon prio=5 os_prio=0 cpu=1212.68ms elapsed=25383.60s tid=0x00007fe620008ac0 nid=0x73 in Object.wait()  [0x00007fe6c680b000]
         java.lang.Thread.State: WAITING (on object monitor)
             at java.lang.Object.wait(java.base@17.0.8.1/Native  Method)
             - waiting on <no object reference available>
             at java.lang.Object.wait(java.base@17.0.8.1/Unknown  Source)
             at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2528)
             - locked <0x00000000e25dd2c8> (a java.lang.Object)
             at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2271)
             at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2106)
             at org.apache.solr.core.SolrCore.withSearcher(SolrCore.java:2124)
             at org.apache.solr.core.SolrCore.getSegmentCount(SolrCore.java:534)
             at org.apache.solr.core.SolrCore.lambda$initializeMetrics$11(SolrCore.java:1360)
             at org.apache.solr.core.SolrCore$$Lambda$760/0x00007fe660634ec8.getValue(Unknown Source)
             at org.apache.solr.metrics.SolrMetricManager$GaugeWrapper.getValue(SolrMetricManager.java:779)
             at org.apache.solr.metrics.reporters.jmx.JmxMetricsReporter$JmxGauge.getValue(JmxMetricsReporter.java:207
      

      This thread gathers multiple metrics one after the other. We can assume, that it had called SolrCore#getSearcher successfully before, and that the previous call has incremented onDeckSearchers to 1.

      A previous call to #getSearcher should eventually decrement onDeckSearchers again, and also set the _searcher field, but that's not happening. Normally, this should happen in SolrCore#registerSearcher, but that method is executed on the searcherExecutor, see
      https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L2681

      The problem is, that searcherExecutor will never execute #registerSearcher because the executor is still blocked by another job. This is because the core is still being created, see
      https://github.com/apache/solr/blob/releases/solr/9.2.1/solr/core/src/java/org/apache/solr/core/SolrCore.java#L1160

      "searcherExecutor-14-thread-1-processing-studio" #50 prio=5 os_prio=0 cpu=0.27ms elapsed=25212.18s tid=0x00007fe608146080 nid=0xbd waiting on condition  [0x00007fe6c637a000]
         java.lang.Thread.State: WAITING (parking)
             at jdk.internal.misc.Unsafe.park(java.base@17.0.8.1/Native  Method)
             - parking to wait for  <0x00000000e738c1a8> (a java.util.concurrent.CountDownLatch$Sync)
             at java.util.concurrent.locks.LockSupport.park(java.base@17.0.8.1/Unknown  Source)
             at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@17.0.8.1/Unknown  Source)
             at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@17.0.8.1/Unknown  Source)
             at java.util.concurrent.CountDownLatch.await(java.base@17.0.8.1/Unknown  Source)
             at org.apache.solr.core.SolrCore.lambda$new$3(SolrCore.java:1162)
             at org.apache.solr.core.SolrCore$$Lambda$815/0x00007fe6606f3c40.call(Unknown Source)
             at java.util.concurrent.FutureTask.run(java.base@17.0.8.1/Unknown  Source) 
      

      In summary, we have

      • thread "qtp1800649922-28" in SolrCore.<init> trying to create a core, which is waiting for
      • #registerSearcher job in the searcherExecutor work queue, which is waiting to be executed by
      • thread "searcherExecutor-14-thread-1-processing-studio" which waits for the first thread "qtp1800649922-28" to unblock it

      Attachments

        Issue Links

          Activity

            People

              stillalex Alex Deparvu
              ahubold Andreas Hubold
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m