[KUDU-2452] Prevent follower from causing pre-elections when UpdateConsensus is slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.7.0
Fix Version/s: None
Component/s: None
Labels:
- stability

Target Version/s:

1.8.0

Description

Thanks to pre-elections (~~KUDU-1365~~), slow UpdateConsensus calls on a single follower don't disturb the whole tablet by calling elections. However, sometimes I see situations where one or more followers are constantly calling pre-elections, and only rarely, if ever, overflowing their service queues. Occasionally, in 3x replicated tablets, the followers will get "lucky" and detect a leader failure at around the same time, and an election will happen.

This background instability has caused bugs like ~~KUDU-2343~~ that should be rare to occur pretty frequently, plus the extra RequestConsensusVote RPCs add a little more stress on the consensus service and on replicas' consensus locks. It also spams the logs, since there's no generally no exponential backoff for these pre-elections because there's a successful heartbeat in between them.

It seems like we can get into the situation where the average number of in-flight consensus requests is constant over time, so on average we are processing each heartbeat in less than the heartbeat interval, however some heartbeats take longer. Since UpdateConsensus calls to a replica are serialized, a few of these in a row trigger the failure detector, despite the follower receiving every heartbeat in a timely manner and responding successfully eventually (and on average in a timely manner).

It'd be nice to prevent these worthless pre-elections. A couple of ideas:
1. Separately calculate a backoff for failed pre-elections, and reset it when a pre-election succeeds or more generally when there's an election.
2. Don't count the time the follower is executing UpdateConsensus against the failure detector. mpercy suggested stopping the failure detector during UpdateReplica() and resuming it when the function returns.
3. Move leader failure detection out-of-band of UpdateConsensus entirely.

Attachments

Issue Links

is related to

KUDU-2947 A replica with slow WAL may grant votes even if established leader is alive and well

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: William Berkeley

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/May/18 21:44

Updated:: 22/Oct/19 20:39