Details
-
Improvement
-
Status: Open
-
Normal
-
Resolution: Unresolved
-
None
-
None
Description
During repair, coordinator and replica exchange various messages. I've seen cases that those messages sometimes get lost.
We've made repair message to be more durable (CASSANDRA-5393, etc) but still messages seem to be lost and hang repair till messaging timeout reaches.
We can prevent this by tracking repair status on repair participants, and periodically check state after certain period of times to make sure everything is working fine.
We alse can add command / JMX API to query repair state.
Attachments
Issue Links
- relates to
-
CASSANDRA-12860 Nodetool repair fragile: cannot properly recover from single node failure. Has to restart all nodes in order to repair again
- Resolved