Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Normal
Description
I'm currently looking into an issue with our repair process where we can notice a significant delay at the end of the repair task and before nodetool is actually terminating. At the same time JMX NOTIF_LOST errors are reported in nodetool during most repair runs.
Currently StorageService.repairAsync(keyspace, options) is called through JMX, which will start a new thread executing RepairRunnable using the provided options. StorageService itself implements NotificationBroadcasterSupport and will send JMX progress notifications emitted from RepairRunnable (or during bootstrap). If you take a closer look at RepairRunnable, JMXProgressSupport and StorageService/NotificationBroadcasterSupport.sendNotification you'll notice that this all happens within the calling thread, i.e. RepairRunnable. Given the lost notifications and all kind of potential networking related issues, I'm not really comfortable having the repair coordinator thread running in the JMX stack. Fortunately NotificationBroadcasterSupport accepts a custom executor as constructor argument. See attached patched.