Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1516

ksck should check for more raft-related status issues

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 0.9.1
    • Fix Version/s: 0.10.0
    • Component/s: consensus, ksck, supportability
    • Labels:
      None
    • Target Version/s:

      Description

      We currently have a test cluster where one or more tablets have gotten under-replicated (1 replica remaining out of 3) and weren't able to re-replicate in time. 'ksck' still reports that the table is healthy though, and just reports two down tablet servers. It seems there is a lot of room for improvement:

      • for each tablet, check that at least a majority of its replicas are on live tablet servers, and those tablet servers consider the replica to be in RUNNING state
      • some basic tablet "health checks" like asking followers if they have recently successfully heard from leader?
      • perhaps a canary request pushed to each tablet? (eg an empty write or no_op)

        Attachments

          Activity

            People

            • Assignee:
              tlipcon Todd Lipcon
              Reporter:
              tlipcon Todd Lipcon
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: