Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1778

Consensus "stuck" after a leader election when both peers were divergent

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.1.0
    • Fix Version/s: 1.2.0
    • Component/s: consensus
    • Labels:
      None
    • Target Version/s:

      Description

      On a stress cluster we saw the following sequence of events following a service restart while under load:

      • a peer is elected leader successfully
      • both of its followers have divergent logs
      • when it connects to a new peer with a divergent log, it decides to fall back to index 0 rather than falling back to the proper committed index of that peer
      • upon falling back to index 0, will never succeed since the first segment of the log was already GCed long ago.

      Thus, the leader thinks that it needs to evict both of the followers and can't replicate to them, and the tablet gets "stuck".

        Attachments

          Activity

            People

            • Assignee:
              tlipcon Todd Lipcon
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: