[CASSANDRA-13308] Gossip breaks, Hint files not being deleted on nodetool decommission - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 3.0.14, 3.11.0, 4.0-alpha1, 4.0
Component/s: Consistency/Hints, Legacy/Streaming and Messaging
Labels:
None
Environment:

Using Cassandra version 3.0.9

Bug Category:
Availability - Unavailable
Severity:
Normal

Description

How to reproduce the issue I'm seeing:
Shut down Cassandra on one node of the cluster and wait until we accumulate a ton of hints. Start Cassandra on the node and immediately run "nodetool decommission" on it.

The node streams its replicas and marks itself as DECOMMISSIONED, but other nodes do not seem to see this message. "nodetool status" shows the decommissioned node in state "UL" on all other nodes (it is also present in system.peers), and Cassandra logs show that gossip tasks on nodes are not proceeding (number of pending tasks keeps increasing). Jstack suggests that a gossip task is blocked on hints dispatch (I can provide traces if this is not obvious). Because the cluster is large and there are a lot of hints, this is taking a while.

On inspecting "/var/lib/cassandra/hints" on the nodes, I see a bunch of hint files for the decommissioned node. Documentation seems to suggest that these hints should be deleted during "nodetool decommission", but it does not seem to be the case here. This is the bug being reported.

To recover from this scenario, if I manually delete hint files on the nodes, the hints dispatcher threads throw a bunch of exceptions and the decommissioned node is now in state "DL" (perhaps it missed some gossip messages?). The node is still in my "system.peers" table

Restarting Cassandra on all nodes after this step does not fix the issue (the node remains in the peers table). In fact, after this point the decommissioned node is in state "DN"

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

28207.stack
09/Mar/17 02:34
220 kB
Arijit Banerjee
logs
09/Mar/17 02:34
2 kB
Arijit Banerjee
logs_decommissioned_node
09/Mar/17 02:39
3 kB
Arijit Banerjee

Issue Links

is duplicated by

CASSANDRA-13562 Cassandra removenode makes Gossiper Thread hang forever

Resolved

Activity

People

Assignee:: Jeff Jirsa

Reporter:: Arijit Banerjee

Authors:: Jeff Jirsa

Reviewers:: Aleksey Yeschenko

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 08/Mar/17 10:09

Updated:: 15/May/20 07:59

Resolved:: 19/Apr/17 16:02