Thanks to pre-elections (
KUDU-1365), slow UpdateConsensus calls on a single follower don't disturb the whole tablet by calling elections. However, sometimes I see situations where one or more followers are constantly calling pre-elections, and only rarely, if ever, overflowing their service queues. Occasionally, in 3x replicated tablets, the followers will get "lucky" and detect a leader failure at around the same time, and an election will happen.
This background instability has caused bugs like
KUDU-2343 that should be rare to occur pretty frequently, plus the extra RequestConsensusVote RPCs add a little more stress on the consensus service and on replicas' consensus locks. It also spams the logs, since there's no generally no exponential backoff for these pre-elections because there's a successful heartbeat in between them.
It seems like we can get into the situation where the average number of in-flight consensus requests is constant over time, so on average we are processing each heartbeat in less than the heartbeat interval, however some heartbeats take longer. Since UpdateConsensus calls to a replica are serialized, a few of these in a row trigger the failure detector, despite the follower receiving every heartbeat in a timely manner and responding successfully eventually (and on average in a timely manner).
It'd be nice to prevent these worthless pre-elections. A couple of ideas:
1. Separately calculate a backoff for failed pre-elections, and reset it when a pre-election succeeds or more generally when there's an election.
2. Don't count the time the follower is executing UpdateConsensus against the failure detector. Mike Percy suggested stopping the failure detector during UpdateReplica() and resuming it when the function returns.
3. Move leader failure detection out-of-band of UpdateConsensus entirely.