[KUDU-1778] Consensus "stuck" after a leader election when both peers were divergent - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 1.1.0
Fix Version/s: 1.2.0
Component/s: consensus
Labels:
None

Target Version/s:

1.2.0

Description

On a stress cluster we saw the following sequence of events following a service restart while under load:

a peer is elected leader successfully
both of its followers have divergent logs
when it connects to a new peer with a divergent log, it decides to fall back to index 0 rather than falling back to the proper committed index of that peer
upon falling back to index 0, will never succeed since the first segment of the log was already GCed long ago.

Thus, the leader thinks that it needs to evict both of the followers and can't replicate to them, and the tablet gets "stuck".

Attachments

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 01/Dec/16 19:07

Updated:: 02/Dec/16 22:34

Resolved:: 02/Dec/16 22:34