Restarting a cluster that has a good amount of data, it's hard to tell when it's "done". Right now the things I do:
- Run ksck, wait until most tablets are not in "unavailable" or "boostrapping" state.
- Watch the metrics and see when the data under management is close to where it was before restarting (it grows as tablets are getting bootstrapped).
- Look at the tablet server web UIs for tablets, compare how many are done bootstrapping VS in the process of VS not started.
Ideas on how to improve this:
- In the master's web UI for tablet servers, show how many tablets are running VS not running (I wouldn't add anything about tombstoned tablets)
- Add metrics for tablets in different states.