Details
Description
I observe worse performance of CloudSolrClient after upgrading from SolrJ 9.5.0 to 9.6.0, especially on p99.
p99 jumped from ~25 ms to ~400 ms
p90 jumped from ~9.9 ms to ~22 ms
p75 jumped from ~7 ms to ~11 ms
p50 jumped from ~4.5 ms to ~7.5 ms
Screenshot from Grafana (at ~14:30 was deployed the new version):
I've got a thread-dump and I can see many threads waiting in ZkStateReader.forceUpdateCollection:
Thread info: "suggest-solrThreadPool-thread-52" prio=5 Id=600 BLOCKED on org.apache.solr.common.cloud.ZkStateReader@62e6bc3d owned by "suggest-solrThreadPool-thread-34" Id=582 at app//org.apache.solr.common.cloud.ZkStateReader.forceUpdateCollection(ZkStateReader.java:506) - blocked on org.apache.solr.common.cloud.ZkStateReader@62e6bc3d at app//org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.getState(ZkClientClusterStateProvider.java:155) at app//org.apache.solr.client.solrj.impl.CloudSolrClient.resolveAliases(CloudSolrClient.java:1207) at app//org.apache.solr.client.solrj.impl.CloudSolrClient.sendRequest(CloudSolrClient.java:1099) at app//org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:892) at app//org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:820) at app//org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:255) at app//org.apache.solr.client.solrj.SolrClient.query(SolrClient.java:927) ... Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@1beb7ed3
At the same time qTime from Solr hasn't changed so I'm pretty sure it's a client regression.
I've tried reproducing it locally and I can see forceUpdateCollection function being called for every request in my application. I can see that this commit
changed the logic in ZkClientClusterStateProvider.getState so the mentioned function gets called if clusterState.getCollectionRef returns null. In 9.5.0 it wasn't the case (forceUpdateCollection was not called in this place). I can see in the debugger that getCollectionRef only supports collections and not aliases (collectionStates map contains only collections). In my application all collections are referenced using aliases so I guess that's why I can see the regression in Solr response time.
I am not familiar with the code enough to prepare a PR but I hope this insight will be enough to fix this issue.
Attachments
Attachments
Issue Links
- is caused by
-
SOLR-17153 CloudSolrClient should not throw "Collection not found" with an out-dated ClusterState
- Closed
- links to