Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-15045

2x latency of synchronous commits due to serial execution on local and distributed leaders

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 8.5.2
    • Fix Version/s: None
    • Component/s: SolrCloud
    • Labels:
      None
    • Environment:

      Operating system: Linux (centos 7.7.1908)

      Description

      Hi All,

      When we issue commit through curl command, not all the shards are getting `start commit` requests at the same time.

      Solr Setup Detail : (Running in solrCloud mode)
      It has 6 shards, and each shard has only one replica (which is also a
      leader) and the replica type is NRT.
      Each shards are hosted on the separate physical host.

      Zookeeper => We are using external zookeeper ensemble (3 separate node
      cluster)

      Shard and Host name
      shard1_0=>solr_199
      shard1_1=>solr_200
      shard2_0=> solr_254
      shard2_1=> solr_132
      shard3_0=>solr_133
      shard3_1=>solr_198

      Request rate on the system is currently zero and only hourly indexing
      running on it.

      We are using curl command to issue commit.

      curl
      "http://solr_254:8389/solr/my_collection/update?openSearcher=true&commit=true&wt=json"

      (Using solr_254 host to issue commit)

      On using the above command all the shards have started processing commit (i.e
      getting `start commit` request) except the one used in curl command (i.e
      shard2_0 which is hosted on solr_254). Individually each shards takes around
      10 to 12 min to process hard commit (most of this time is spent on reloading
      external files).
      As per logs, shard2_0 is getting `start commit` request after 10 minutes
      (approx). This leads to following timeout error.

      2020-12-06 18:47:47.013 ERROR
      org.apache.solr.client.solrj.SolrServerException: Timeout occured while
      waiting response from server at:
      http://solr_132:9744/solr/my_collection_shard2_1_replica_n21/update?update.distrib=TOLEADER&distrib.from=http%3A%2F%2Fsolr_254%3A9744%2Fsolr%2Fmy_collection_shard2_0_replica_n11%2F
            at
      org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:407)
            at
      org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:753)
            at
      org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.request(ConcurrentUpdateHttp2SolrClient.java:369)
            at
      org.apache.solr.client.solrj.SolrClient.request(SolrClient.java:1290)
            at
      org.apache.solr.update.SolrCmdDistributor.doRequest(SolrCmdDistributor.java:344)
            at
      org.apache.solr.update.SolrCmdDistributor.lambda$submit$0(SolrCmdDistributor.java:333)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at
      java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at
      com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:180)
            at
      org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
            at
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at
      java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:748)
          Caused by: java.util.concurrent.TimeoutException
            at
      org.eclipse.jetty.client.util.InputStreamResponseListener.get(InputStreamResponseListener.java:216)
            at
      org.apache.solr.client.solrj.impl.Http2SolrClient.request(Http2SolrClient.java:398)
            ... 13 more

      Above timeout error is between solr_254 and solr_132. Similar errors are
      there between solr_254 and other 4 shards

      Since query load is zero, mostly CPU utilization is around 3%.
      After issuing curl commit command, CPU goes up to 14% on all shards except
      shard2_0 (host: solr_254, the one used in curl command).
      And after 10 minutes (i.e after getting the `start commit` request)  CPU  on
      shard2_0 also goes up to 14%.

      As I mentioned earlier each shards take around 10-12 mins to process commit
      and due to delay in starting commit process on one shard (shard2_0) our
      overall commit time is doubled now. (22-24 minutes approx).

      We are observing this delay in both hard and soft commit.

      In our solr-5.4.0(having similar setup), we use the similar curl command to issue commit, and there all the shards are getting `start commit` request at same time. Including the one used in curl command.

       

      Impact After deleting external files:

      In order to nullify the impact of external files, I had deleted external
      files from all the shards and issued commit through the curl command. Commit
      operation got completed in 3 seconds. Individual shards took 1.5 seconds to
      complete the commit operation. But there was a delay of around 1.5 seconds
      on the shard whose hostname was used to issue the commit. Hence overall
      commit time is 3 seconds.

      During this operation, there was no timeout or any other kind of error
      (except `external file not found` error which is expected).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                raj.yadav1 Raj Yadav
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m