To enhance the visibility in repair, we should expose internal state via virtual tables; the state should include coordinator as well as participant state (validation, sync, etc.)
I propose the following tables:
repairs - high level summary of the global state of repair; this should be called on the coordinator.
repair_tasks - represents RepairJob and participants state. This will show if validations are running on participants and the progress they are making; this should be called on the coordinator.
repair_validations - shows the state of the validation task and updated periodically while validation is running; this should be called on the participants.
The main reason for exposing virtual tables rather than exposing through durable tables is to make sure what is exposed is accurate. In cases of write failures or node failures, the durable tables could become in-accurate and could add edge cases where the repair is not running but the tables say it is; by relying on repair's internal in-memory bookkeeping, these problems go away.
This jira does not try to solve the following:
1) repair resiliency - there are edge cases where repair hits an error and runs forever (at least from nodetool's perspective).
2) repair stream tracking - I have not learned the streaming side yet and what I see is multiple implementations exist, so seems like high scope. My hope is to punt from this jira and tackle separately.
|Refactor repair coordinator so errors are consistent||Resolved||