Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
8.8, main (10.0)
-
None
Description
We have observed certain weird behavior with our Solr that the cluster state for certain collections seem to be "stuck" in certain stale states. While inspecting the current code logic, it's found that certain race condition can arise and lead to inconsistent states of `collectionWatches` and `watchedCollectionStates`
By looking at the current design, it appears that those 2 fields:
- collectionWatches - for "watching" ZK updates on collections (via notification). Such map has the collection name (ie org) as the key and value is CollectionWatch which holds another set of DocCollectionWatchers in this case (ie ConcurrentHashMap<String, CollectionWatch<DocCollectionWatcher>> collectionWatches)
- watchedCollectionStates - as ConcurrentHashMap<String, DocCollection>, which key is collection name and value is the DocCollection value triggered by previous watch event handled by ZkStateReader$StateWatcher (either by the fetch on first watcher registration or by notification from ZK on state changes)
On the design level, such 2 fields should be "in-sync" , ie if a collection is no longer tracked in collectionWatches then it should not have any entry in watchedCollectionStates neither.
However such guarantee is not a strong one as we can see there are various code pieces that try to remove entries from watchedCollectionStates if somehow the collection is no longer in collectionWatches for example in here , in particular this appears to address certain race condition with unregisterCore. The code that removes registered DocCollectionWatcher also tries to ensure both maps are in sync as in here
The core of the issue appears to be that there's an assumption when the last DocCollectionWatcher is removed from the CollectionWatch, both the watchedCollectionStates and collectionWatches should be purged of the watched collection. Hence the clusterState should have the LazyCollectionRef instead, which DocCollection get(boolean allowCached) invocation should return the correct DocCollection state on allowCached=false. Unfortunately, there could still be possible race condition as far as there are 2 separate maps. One possible race condition is demonstrated in later section
PR proposal
An idea is to Merge DocCollectionWater into CollectionWatch so we would not run into race condition that the collection key appears in one but not the other. Please see the PR here https://github.com/apache/solr/pull/909 and would love to gather some feedbacks and thoughts!
Steps to reproduce a race condition
Spin up 2 nodes, only one node should serve the test collection, the other node should be made the overseer/leader of the cluster. Debug on the overseer node
- From the IDE, ensure all breakpoints suspend "thread" only, not "all" (that's for intelliJ)
- All breakpoints are in ZkStateReader. Set breakpoint at line here, put condition Thread.currentThread().getName().startsWith("zkCallback") . This pauses when zk notification comes in and right before it checks whether the collection is already in watchedCollectionStates in method updateWatchedCollection
- Set breakpoint at here, which throws TimeoutException after the latch timeout, for example from call of (inside ZkStateWatcher#waitForState)
- now start debugging the overseer node
- Issue a split shard request to the overseer. For example http://localhost:8983/solr/admin/collections?action=SPLITSHARD&collection=test&shard=shard1
- Eventually a thread will stop the first breakpoint within updateWatchedCollection, the thread name should be something like zkCallback-127-thread-2
- Might have to wait for 320 secs (timeout on CollectionHandlingUtils.waitForNewShard), another thread should stop at 2nd breakpoint throwing TimeoutException, the thread name should be something similar to OverseerThreadFactory...
- Add breakpoint at removeDocCollectionWatcherhere, that removes the entry from watchedCollectionStates but not yet purge the collectionWatches
- Resume this OverseerThreadFactory thread that's going to throw TimeoutException, it should stop the new breakpoint in removeDocCollectionWatcher.
- Switch back to thread zkCallback..., resume that thread, it should find that the watchedCollectionStates is empty, hence trying to insert a CollectionRef into it
- Switch back to OverseerThreadFactory thread, step a few steps, it will purge the collectionWatches but the watchedCollectionStates will still have one entry in it.
- If we inspect the zkStateReader.clusterState at this point, we will notice that the collection will have a non lazy CollectionRef for the test collection, while the collectionWatches is empty but watchedCollectionStates would still have the collection state in there