[KUDU-2149] New failure detector implementation can lead to election stacking - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.5.0
Fix Version/s: 1.6.0
Component/s: consensus
Labels:
None

Code Review:
https://gerrit.cloudera.org/#/c/8107

Description

A new failure detector (FD) implementation was merged in commit 21b0f3d and is part of Kudu 1.5. One of the key changes is that the detection logic runs on a reactor thread rather than on a dedicated per-replica thread. But, because reactor threads are shared, the election started in the event of a failure must be thunked to the Raft thread pool (starting an election means casting a vote, which generally means performing IO, which is verboten on a reactor thread).

By thunking, the FD immediately rearms; the previous implementation did not do this. If there's a lot of outstanding IO (i.e. during an election storm across thousands of tablets), it's possible for the FD to fire again while the first election task is still waiting to cast its vote. The new election task will try to acquire the consensus lock and block on it (it's held by the first election task). And so on. When the original IO finally completes, all of the follow-on elections will get unblocked at the same time.

Attachments

Issue Links

is related to

KUDU-2155 Disarm failure detector during an election

Resolved

Activity

People

Assignee:: Adar Dembo

Reporter:: Adar Dembo

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 19/Sep/17 23:11

Updated:: 04/Feb/20 23:20

Resolved:: 21/Sep/17 01:52