Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1516

ksck should check for more raft-related status issues

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 0.9.1
    • 0.10.0
    • consensus, ksck, supportability
    • None

    Description

      We currently have a test cluster where one or more tablets have gotten under-replicated (1 replica remaining out of 3) and weren't able to re-replicate in time. 'ksck' still reports that the table is healthy though, and just reports two down tablet servers. It seems there is a lot of room for improvement:

      • for each tablet, check that at least a majority of its replicas are on live tablet servers, and those tablet servers consider the replica to be in RUNNING state
      • some basic tablet "health checks" like asking followers if they have recently successfully heard from leader?
      • perhaps a canary request pushed to each tablet? (eg an empty write or no_op)

      Attachments

        Activity

          People

            tlipcon Todd Lipcon
            tlipcon Todd Lipcon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: