Thanks Shalin for looking into the patch and your review.
ForceLeaderTest.testReplicasInLIRNoLeader has a 5 second sleep, why? Isn't waitForRecoveriesToFinish() enough?
Fixed. This was a left over from some previous patch. I think I wanted to put the waitForRecoveriesToFinish(), but forgot to remove the 5 second sleep.
Similarly, ForceLeaderTest.testLeaderDown has a 15 second sleep for steady state to be reached? What is this steady state, is there a better way than waiting for an arbitrary amount of time? In general, Thread.sleep should be avoided as much as possible as a way to reach steady state.
In this case, waiting those 15 seconds results in one of the down replicas to become a leader (but stay down). This is the situation I'm using FORCELEADER to recover from. Instead of waiting 15 seconds, I've added some polling with wait to wake up earlier if needed, while increasing the timeout from 15s to 25s.
Can you please add some javadocs on the various test methods describing the scenario that they are test?
minor nit - can you use assertEquals when testing equality of state etc instead of assertTrue. The advantage with assertEquals is that it logs the mismatched values in the exception messages.
Used assertEquals() now.
In OverseerCollectionMessageHandler, lirPath can never be null. The lir path should probably be logged in debug rather than INFO.
Thanks for the pointer, I've removed the null check. I feel this should be INFO instead of DEBUG, so that if a user says I issued FORCELEADER but still nothing worked for him, his logs would help us understand if we ever had any LIR state which was cleared out. But, please feel free to remove it if this doesn't make sense.
minor nit - you can compare enums directly using == instead of .equals
Referring to the following, what is the thinking behind it? when can this happen? is there a test which specifically exercises this scenario? seems like this can interfere with the leader election if the leader election was taking some time?
I modified the comment text to make it more clear. This is for the situation when all replicas are (somehow, due to bug maybe?) down/recovering (but not in LIR), and there is no leader, even though many replicas are on live; I don't know if this ever happens (the LIR case happens, I know). The testAllReplicasDownNoLeader test exercises this scenario. This is more or less the scenario that you described (with one difference that there is no leader as well): Leader is not live: Replicas are live but 'down' or 'recovering' -> mark them 'active'.
As you point out, I think it can indeed interfere with any on-going leader election; my thought was that this FORCELEADER call is issued only because the leader election isn't achieving a stable leader, so force marking the queue head replica as leader is okay. But I defer to your judgement if this is fine or not, and I can remove (or you feel free to remove) that code path from the patch if you feel it is not right.