[CASSANDRA-13480] nodetool repair can hang forever if we lose the notification for the repair completing/failing - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Low
Resolution: Fixed
Fix Version/s: 4.0-alpha1, 4.0
Component/s: Tool/nodetool
Labels:
- repair

Severity:
Low

Description

When a Jmx lost notification occurs, sometimes the lost notification in question is the notification which let's RepairRunner know that the repair is finished (ProgressEventType.COMPLETE or even ERROR for that matter).
This results in nodetool process running the repair hanging forever.

I have a test which reproduces the issue here:
https://github.com/Jollyplum/cassandra-dtest/tree/repair_hang_test

To fix this, If on receiving a notification that notifications have been lost (JMXConnectionNotification.NOTIFS_LOST), we instead query a new endpoint via Jmx to receive all the relevant notifications we're interested in, we can replay those we missed and avoid this scenario.

It's possible also that the JMXConnectionNotification.NOTIFS_LOST itself might be lost and so for good measure I have made RepairRunner poll periodically to see if there were any notifications that had been sent but we didn't receive (scoped just to the particular tag for the given repair).

Users who don't use nodetool but go via jmx directly, can still use this new endpoint and implement similar behaviour in their clients as desired.
I'm also expiring the notifications which have been kept on the server side.
Please let me know if you've any questions or can think of a different approach, I also tried setting:
JVM_OPTS="$JVM_OPTS -Djmx.remote.x.notification.buffer.size=5000"
but this didn't fix the test. I suppose it might help under certain scenarios but in this test we don't even send that many notifications so I'm not surprised it doesn't fix it.
It seems like getting lost notifications is always a potential problem with jmx as far as I can tell.

Attachments

Issue Links

is related to

CASSANDRA-14453 Improve visibility into repair state

Open

CASSANDRA-8076 Expose an mbean method to poll for repair job status

Resolved

supercedes

CASSANDRA-8076 Expose an mbean method to poll for repair job status

Resolved

links to

GitHub Pull Request #122

Activity

People

Assignee:: Matt Byrd

Reporter:: Matt Byrd

Authors:: Matt Byrd

Reviewers:: Chris Lohfink

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 28/Apr/17 01:19

Updated:: 07/Mar/23 11:52

Resolved:: 29/Jun/17 19:17