I was investigating some failures in TestCloudSearcherWarming.testRepFactor1LeaderStartup which lead me to the hunch that waitForState wasn't ensuring that the predicates registered would always be called if/when a node was shutdown.
Digging into it a bit more, I found that the root cause seems to be the way the CollectionStateWatcher / CollectionStatePredicate APIs pass in both the DocCollection, and the "current" liveNodes - but are only triggered by the StateWatcher on the state.json (which is used to rebuild the DocCollection) - when the CollectionStateWatcher / CollectionStatePredicate are called, they get the "fresh" DocCollection but they get the cached ZkStateReader.liveNodes
Meanwhile, the LiveNodeWatcher only calls refreshLiveNodes() only updates ZkStateReader.liveNodes and triggers any LiveNodesListener - it does NOT invoke any CollectionStateWatcher that may have replicas hosted on any of changed nodes.
Since there is no garunteed order that Watchers will be triggered, this means there is a race condition where the following can happen...
- client1 has a ZkStateReader with cached liveNodes=[N1, N2, N3]
- client1 registers a CollectionStateWatcher "watcherZ" that cares if "replicaX" of collectionA is on a "down" node
- client2 causes shutdown of node N1 which is hosting replicaX
- client1's zkStateReader gets a WatchedEvent for state.json of collectionA
- DocCollection for collectionA is rebuilt
- watcherZ is fired w/cached liveNodes=[N1, N2, N3] and the new DocCollection
- watcherZ sees that replicaX is on N1, but thinks N1 is live
- watcherZ says "everything ok, not the event i was waiting for" and doesn't take any action
- client1's zkStateReader gets a WatchedEvent for LIVE_NODES_ZKNODE
- zkStateReader.liveNodes is rebuilt
...at no point in this sequence (or after this) will watcherZ be notified fired with the updated liveNodes (unless/until another state.json change is made for collectionA.
While this is definitely be problematic in tests that deal with node lifecyle and use things like SolrCloudTestCase.waitForState(..., SolrCloudTestCase.clusterShape(...)) to check for the expected shards/replicas, a cursory search of how/where ZkStateReader.waitForState(...) and ZkStateReader.registerCollectionStateWatcher(...) are used in solr-core suggests that could also lead to bad behavior in situations like reacting to shard leader loss, waiting for all leaders of SYSTEM_COLL to come online for upgrade, running PrepRecoveryOp, etc... (anywhere that liveNodes is used by the watcher/predicate)