We currently have a test cluster where one or more tablets have gotten under-replicated (1 replica remaining out of 3) and weren't able to re-replicate in time. 'ksck' still reports that the table is healthy though, and just reports two down tablet servers. It seems there is a lot of room for improvement:
- for each tablet, check that at least a majority of its replicas are on live tablet servers, and those tablet servers consider the replica to be in RUNNING state
- some basic tablet "health checks" like asking followers if they have recently successfully heard from leader?
- perhaps a canary request pushed to each tablet? (eg an empty write or no_op)