This was discovered while auditing jenkins failures from
TestCollectionsAPIViaSolrCloudCluster.testCollectionCreateSearchDelete (where a test explicitly deletes and then recreates a collection with the same name), but as noted in a comment below, SOLR-11392 is another example of non-obvious test failures that can pop up because of this bug.
In practice, it can affect any CloudSolrClient user after changes have been made to a collection (to add/move replicas, etc...)
Original jira notes...
seems to fail with non-trivial frequency, so I grabbed the logs from a recent failure and starting trying to follow along with the actions to figure out what exactly is happening....
The crux of this failure appears to be a genuine bug in how CloudSolrClient uses it's cached ClusterState info when doing (direct) updates. The key bits seem to be:
- CloudSolrClient does something (update,query,etc...) with a collection causing the current cluster state for the collection to be cached
- The actual collection changes such that a Solr node/core no longer exists as part of the collection
- CloudSolrClient is asked to process an UpdateRequest which triggers the code paths for the directUpdate() method – which attempts to route the updates directly to a replica of the appropriate shard using the (cache) collection state info
- CloudSolrClient (may) attempt to send that UpdateRequest to a node/core that doesn't exist, getting a 404 – which does not (seem to) trigger a state refresh, or retry to find a correct URL to resend the update to.
Details to follow in comment....