Yeah, I started to take a whack at it at one point, basically taking control of the ordering of the election queue but abandoned it due to time constraints. One problem is that we're bastardizing the whole ephemeral election process in ZK and resorting to the "tie breaker" code that does things like "find the next guy and jump down two, unless you're within the first two of the head in which case do nothing". And the sorting is sensitive to the session ID to boot.
The TestRebalanceLeaders code exercises the shard leader election, we can see if we can extend it. I'm not sure how robust it is when nodes are flaky.
You mentioned at one point that you wondered whether the whole "watch the guy in front" and ZKs ephemeral-sequential node was the right way to approach this. The hack I started still used that mechanism, just took better control of how nodes were inserted into the leader election queue so I don't think that approach really addresses why this has spun out of control.
I really wonder if we should change the mechanism. It seems to me that the fundamental fragility (apart from how hard the code is to understand) is that if the sequence of who watches which ephemeral node somehow gets out of whack, there is no mechanism for letting the other nodes in the queue know that there's a problem that needs to be sorted out which can result in no leaders I assume. Certainly happened often enough to me.
I wonder if tying leader election into ZK state changes rather than watching the ephemeral election node-in-front is a better way?
This has not been thought out, but what about something like:
Solr gets a notification of state change from ZK and drops into the "should I be leader" code which gets significantly less complex.
-1> If I'm not active, ??? Probably just return assuming the next state change will re-trigger this code.
0> If I'm not in the election queue, put myself at the tail. (handles mysterious out-of-whack situations)
1> If there is a leader and it's active, return. (if it's in the middle of going down, we should get another state change when it's down, right?)
2a> If some other node is both active and the preferred leader return (again depending on a state change message if that node goes down to get back to this code)
2b> If I'm the preferred leader, take over leadership.
3> If any other node in the leader election queue in front of me is active, return (state change gets us back here if those nodes are going down).
4> take over leadership.
Since this operates off of state changes to ZK, it seems like it gives us the chance to recover from weird situations. I don't think it increases traffic, don't all ZK state changes have to go to all nodes anyway?
I'm not sure in this case whether we even need a leader election queue at all. Is the clusterstate any less robust than the election queue? Even if it would be just as good, not sure how you'd express "the node in front". Actually, a simple counter property in the state for each replica would do it maybe. You'd set it at one more than any other node in the collection when a node changed its state to "active". I'll freely admit though, you've seen a lot more in the weeds here than I have so I'll defer to your experience.
Anyway, let's kick the tires of what's to be done, maybe we can tag-team this. I consider the above just a jumping-off point to tame this beast. Be glad to chat if you or anyone else wants to kick it around...
One thing I'm not real clear on is how up-to-date the ZK cluster state is. Since changing the state is done through the Overseer, how to insure that the state is current when making decisions?