[KUDU-1516] ksck should check for more raft-related status issues - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: 0.9.1
Fix Version/s: 0.10.0
Component/s: consensus, ksck, supportability
Labels:
None

Target Version/s:

1.0.0

Description

We currently have a test cluster where one or more tablets have gotten under-replicated (1 replica remaining out of 3) and weren't able to re-replicate in time. 'ksck' still reports that the table is healthy though, and just reports two down tablet servers. It seems there is a lot of room for improvement:

for each tablet, check that at least a majority of its replicas are on live tablet servers, and those tablet servers consider the replica to be in RUNNING state
some basic tablet "health checks" like asking followers if they have recently successfully heard from leader?
perhaps a canary request pushed to each tablet? (eg an empty write or no_op)

Attachments

Activity

People

Assignee:: Todd Lipcon

Reporter:: Todd Lipcon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Jul/16 20:16

Updated:: 22/Jul/16 22:05

Resolved:: 22/Jul/16 22:05