Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
With the introduction of autoscaling framework, we have seen an increase in the number of issues related to the race condition between delete a replica and other stuff.
Case 1: DeleteReplicaCmd failed to send UNLOAD request to a replica, therefore, forcefully remove its entry from clusterstate, but the replica still function normally and be able to become a leader -> SOLR-12176
Case 2:
- DeleteReplicaCmd enqueue a DELETECOREOP (without sending a request to replica because the node is not live)
- The node start and the replica get loaded
- DELETECOREOP has not processed hence the replica still present in clusterstate --> pass checkStateInZk
- DELETECOREOP is executed, DeleteReplicaCmd finished
- result 1: the replica start recovering, finish it and publish itself as ACTIVE --> state of the replica is ACTIVE
- result 2: the replica throw an exception (probably: NPE)
--> state of the replica is DOWN, not join leader election