Attaching preliminary patch in case anyone wants to have a look or give feedback before the review-ready version
- Add nodetool repair --list to list ongoing repair jobs (parent repair sessions) in the local node
- Add nodetool repair --abort <jobId> and nodetool repair --abort-all to abort a specific or all jobs
- Any participant can abort the repair job:
- When a participant receives an abort request, it sends an abort message to the coordinator and abort its local tasks
- When a coordinator receives an abort message or abort request, it sends an abort message to all participants and abort its local tasks, failing the repair job
- Add abort support to StreamResultFuture and StreamSession
- Refactor ActiveRepairService and RepairMessageVerbHandler
- Add dtests to abort repair on coordinator and participants on different phases (validation, sync, anticompaction)
- Fix races and leaks found during dtests
Limitations and next steps
While compactions have abort/stop support via CompactionManager.stopCompactionById,
we cannot guarantee it's going to be aborted during a repair abortion because it's abort handler (Holder) is only registered during iteration via the CompactionIterator, so if we stop the compaction before that the task is not aborted, and will execute even if it's parent repair session was aborted. Furthermore, an anti-compaction is split into multiple subcompactions, so this method only stop the currently running subcompaction.
In order to overcome this, I aborted the compaction task Future directly, which causes the task thread to be interrupted, so I check for Thread.currentThread.isInterrupted() during iteration and throw a CompactionInterruptedException if this is true, causing the compaction to be aborted (by brute force).
However this is not very safe, because it can generate a ClosedByInterruptException if we're blocked on an I/O operation, and we currently treat any IOException as a corrupt sstable. Furthermore, an interrupted thread is not able to abort the transaction when getting a CompactionInterruptedException. In order to solve this we could special case interruptions in many places (readers, transaction aborting, etc) but even this wouldn't guarantee we're safe so this is probably a bad smell.
A cleaner option that I will be doing in the next iteration is to associate a CompactionHolder with a ListenableFuture as soon as the anti-compaction or validation is submitted, so we can abort it safely without interrupting the compaction thread.