Description
This was discovered while auditing jenkins failures from
TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete (where a test explicitly deletes and then recreates a collection with the same name), but as noted in a comment below, SOLR-11392 is another example of non-obvious test failures that can pop up because of this bug.
In practice, it can affect any CloudSolrClient user after changes have been made to a collection (to add/move replicas, etc...)
Original jira notes...
TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete
seems to fail with non-trivial frequency, so I grabbed the logs from a recent failure and starting trying to follow along with the actions to figure out what exactly is happening....
https://jenkins.thetaphi.de/job/Lucene-Solr-master-Linux/20662/
[junit4] ERROR 20.3s J1 | TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete <<< [junit4] > Throwable #1: org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from server at https://127.0.0.1:42959/solr/testcollection_shard1_replica_n3: Expected mime type a pplication/octet-stream but got text/html. <html> [junit4] > <head> [junit4] > <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"/> [junit4] > <title>Error 404 </title>
The crux of this failure appears to be a genuine bug in how CloudSolrClient uses it's cached ClusterState info when doing (direct) updates. The key bits seem to be:
- CloudSolrClient does something (update,query,etc...) with a collection causing the current cluster state for the collection to be cached
- The actual collection changes such that a Solr node/core no longer exists as part of the collection
- CloudSolrClient is asked to process an UpdateRequest which triggers the code paths for the directUpdate() method – which attempts to route the updates directly to a replica of the appropriate shard using the (cache) collection state info
- CloudSolrClient (may) attempt to send that UpdateRequest to a node/core that doesn't exist, getting a 404 – which does not (seem to) trigger a state refresh, or retry to find a correct URL to resend the update to.
Details to follow in comment....
Attachments
Attachments
Issue Links
- breaks
-
SOLR-11392 StreamExpressionTest.testParallelExecutorStream fails too frequently
- Open
- links to