Description
We will often see many UpdateConsensus() RPC calls time out when disks are slow. We need to investigate this issue further and understand the dynamics better, then find a solution.
When the local disks on a Kudu cluster get overloaded, RaftConsensus metadata fsyncs caused by Raft votes and term changes take longer, which causes the RaftConsensus lock to be held. This causes "stacking" of UpdateConsensus() RPCs, resulting in timeouts.
Attachments
Issue Links
- duplicates
-
KUDU-1788 Raft UpdateConsensus retry behavior on timeout is counter-productive
- Resolved