We are periodically seeing tricky concurrency bug in SolrCloud that starts with `Could not fully remove collection: my_collection` exception:
After that all operations with SolrCloud that involve reading cluster state fail with
See full stacktraces
As a result SolrCloud becomes completely broken. We are seeing this with 6.5.1 but I think we’ve seen that with older versions too.
From looking into the code it looks like it is a combination of two factors:
- Forcefully removing collection's znode in finally block in DeleteCollectionCmd that was introduced in
SOLR-5135. Note that this causes cached cluster state to be not in sync with the state in Zk, i.e. zkStateReader.getClusterState() still has collection in it (see the code here) whereas /collections/<collection_id> znode in Zk is already removed.
- Reading cluster state operation not only returns cached version, but it is also reading collection's config name from /collections/<collection_id> znode, but this znode was forcefully removed. The code to read config name for every collection directly from Zk was introduced in
SOLR-7636. Isn't there any performance implications of reading N znodes (1 per collection) on every getClusterStatus call?
I'm not sure what the proper fix should be
- Should we just catch KeeperException$NoNodeException in getClusterStatus and treat such collection as removed? That looks easiest / less invasive fix.
- Should we stop reading config name from collection znode and get it from cache somehow?
- Should we not try to delete collection's data from Zk if delete operation failed?