Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1778

Consensus "stuck" after a leader election when both peers were divergent

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.1.0
    • 1.2.0
    • consensus
    • None

    Description

      On a stress cluster we saw the following sequence of events following a service restart while under load:

      • a peer is elected leader successfully
      • both of its followers have divergent logs
      • when it connects to a new peer with a divergent log, it decides to fall back to index 0 rather than falling back to the proper committed index of that peer
      • upon falling back to index 0, will never succeed since the first segment of the log was already GCed long ago.

      Thus, the leader thinks that it needs to evict both of the followers and can't replicate to them, and the tablet gets "stuck".

      Attachments

        Activity

          People

            tlipcon Todd Lipcon
            tlipcon Todd Lipcon
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: