[CASSANDRA-9218] Node thinks other nodes are down after heavy GC - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Duplicate
Fix Version/s: None
Component/s: None
Labels:
None

Severity:
Normal
Since Version:

2.0.13

Description

I have a few troublesome nodes which often end up doing very long GC pauses. The root cause of this is yet to be found, but it's causing another problem - the affected node(s) mark other nodes as down, and they never recover.

Here's how it goes:

1. Node goes into troublesome mode, doing heavy GC with long (10+ seconds) GC pauses.
2. While this happens, node will mark other nodes as down.
3. Once the overload situation resolves, the node still thinks the other nodes are down (they are not). It's also quite common that other nodes think the affected node is down.

So we often end up with node A thinking there's some 30 nodes down, then a bunch of other nodes beliving node A is down. This in a cluster with 56 nodes.

The only way to get out of the situation is to restart node A, and sometimes a few other nodes. And while node A is in this state, any queries that use node A as coordinator have a high risk of getting errors about not enough replicas being available.

I have enabled TRACE level gossip debugging while this happens, and on node A, there will be multiple messages about, "has already a pending echo, skipping it" - i.e the debug line in Gossiper.java line 882.

I have also observed while this was happening that other nodes were trying to establish connections (SYN packets sent) but the trouble node (A) were not picking up the line (no accept()).

Not knowing exactly how Gossiper works here but it looks like node A is sending out some gossiper echo messages, but then is too busy to get the replies, and never retries.

Attachments

Issue Links

duplicates

CASSANDRA-9183 Failure detector should detect and ignore local pauses

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Erik Forsberg

Votes:: 1 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 21/Apr/15 08:15

Updated:: 16/Apr/19 09:31

Resolved:: 21/Apr/15 13:06