Details
-
Improvement
-
Status: Open
-
Low
-
Resolution: Unresolved
-
None
-
None
Description
Currently, if one node fails any phase of the repair (validation, streaming), the repair session is aborted, but the other nodes are not notified and keep doing either validation or syncing with other nodes.
With CASSANDRA-10070 automatically scheduling repairs and potentially scheduling retries it would be nice to make sure all nodes abort failed repairs in other to be able to start other repairs safely in the same nodes.
From CASSANDRA-10070:
As far as I understood, if there are nodes A, B, C running repair, A is the coordinator. If validation or streaming fails on node B, the coordinator (A) is notified and fails the repair session, but node C will remain doing validation and/or streaming, what could cause problems (or increased load) if we start another repair session on the same range.
We will probably need to extend the repair protocol to perform this cleanup/abort step on failure. We already have a legacy cleanup message that doesn't seem to be used in the current protocol that we could maybe reuse to cleanup repair state after a failure. This repair abortion will probably have intersection with CASSANDRA-3486. In any case, this is a separate (but related) issue and we should address it in an independent ticket, and make this ticket dependent on that.
On CASSANDRA-5426 slebresne suggested doing this to avoid unexpected conditions/hangs:
I wonder if maybe we should have more of a fail-fast policy when there is errors. For instance, if one node fail it's validation phase, maybe it might be worth failing right away and let the user re-trigger a repair once he has fixed whatever was the source of the error, rather than still differencing/syncing the other nodes.
Going a bit further, I think we should add 2 messages to interrupt the validation and sync phase. If only because that could be useful to users if they need to stop a repair for some reason, but also, if we get an error during validation from one node, we could use that to interrupt the other nodes and thus fail fast while minimizing the amount of work done uselessly.
Attachments
Issue Links
- is related to
-
CASSANDRA-3486 Node Tool command to stop repair
- Open
-
CASSANDRA-5426 Redesign repair messages
- Resolved
- is required by
-
CASSANDRA-11264 Repair scheduling - Failure handling and retry
- Open
-
CASSANDRA-11263 Repair scheduling - Polling and monitoring module
- Awaiting Feedback
-
CASSANDRA-10070 Automatic repair scheduling
- Open