Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-17275

Major performance regression of CloudSolrClient in Solr 9.6.0 when using aliases

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • 9.6
    • 9.6.1
    • SolrJ
    • None
    • SolrJ 9.6.0, Ubuntu 22.04, Java 17

    Description

      I observe worse performance of CloudSolrClient after upgrading from SolrJ 9.5.0 to 9.6.0, especially on p99. 

      p99 jumped from ~25 ms to ~400 ms
      p90 jumped from ~9.9 ms to ~22 ms
      p75 jumped from ~7 ms to ~11 ms
      p50 jumped from ~4.5 ms to ~7.5 ms

      Screenshot from Grafana (at ~14:30 was deployed the new version):

      I've got a thread-dump and I can see many threads waiting in ZkStateReader.forceUpdateCollection:

      Thread info: "suggest-solrThreadPool-thread-52" prio=5 Id=600 BLOCKED on org.apache.solr.common.cloud.ZkStateReader@62e6bc3d owned by "suggest-solrThreadPool-thread-34" Id=582
      	at app//org.apache.solr.common.cloud.ZkStateReader.forceUpdateCollection(ZkStateReader.java:506)
      	-  blocked on org.apache.solr.common.cloud.ZkStateReader@62e6bc3d
      	at app//org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getState(ZkClientClusterStateProvider.java:155)
      	at app//org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(CloudSolrClient.java:1207)
      	at app//org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1099)
      	at app//org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:892)
      	at app//org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:820)
      	at app//org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:255)
      	at app//org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:927)
      	...
      	Number of locked synchronizers = 1
      	- java.util.concurrent.ThreadPoolExecutor$Worker@1beb7ed3
      

      At the same time qTime from Solr hasn't changed so I'm pretty sure it's a client regression.

      I've tried reproducing it locally and I can see forceUpdateCollection function being called for every request in my application. I can see that this commit
       changed the logic in ZkClientClusterStateProvider.getState so the mentioned function gets called if clusterState.getCollectionRef returns null. In 9.5.0 it wasn't the case (forceUpdateCollection was not called in this place). I can see in the debugger that getCollectionRef only supports collections and not aliases (collectionStates map contains only collections). In my application all collections are referenced using aliases so I guess that's why I can see the regression in Solr response time.

      I am not familiar with the code enough to prepare a PR but I hope this insight will be enough to fix this issue.

      Attachments

        1. image-2024-05-06-17-23-42-236.png
          36 kB
          Rafał Harabień

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rafalh Rafał Harabień
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h