Sometimes the nodes in the cluster will crash due to machine problems such as disk corruption, which can be very common. However, if there are some dead tservers, ksck result will always show error (e.g. Not all Tablet Servers are reachable) although all tables have recovered to be healthy.
The only way now to get the healthy status of ksck is to restart all masters one by one. In some cases, for example, if the machine has completely corrupted, we hope to get healthy status of ksck without restarting, since after restarting masters the cluster will take some time to recover, during which it will have influence on scanning or upsetting to tables. The recovery time can be long which mainly depends on the scale of cluster. This problem can be serious and annoying especially tservers crashed with high-frequency in a large cluster.
It’s valuable if we have an easier way to delete dead tservers from master, I will support a kudu command to realize it.