Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-9218

Node thinks other nodes are down after heavy GC

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Duplicate
    • None
    • None
    • None
    • Normal

    Description

      I have a few troublesome nodes which often end up doing very long GC pauses. The root cause of this is yet to be found, but it's causing another problem - the affected node(s) mark other nodes as down, and they never recover.

      Here's how it goes:

      1. Node goes into troublesome mode, doing heavy GC with long (10+ seconds) GC pauses.
      2. While this happens, node will mark other nodes as down.
      3. Once the overload situation resolves, the node still thinks the other nodes are down (they are not). It's also quite common that other nodes think the affected node is down.

      So we often end up with node A thinking there's some 30 nodes down, then a bunch of other nodes beliving node A is down. This in a cluster with 56 nodes.

      The only way to get out of the situation is to restart node A, and sometimes a few other nodes. And while node A is in this state, any queries that use node A as coordinator have a high risk of getting errors about not enough replicas being available.

      I have enabled TRACE level gossip debugging while this happens, and on node A, there will be multiple messages about, "has already a pending echo, skipping it" - i.e the debug line in Gossiper.java line 882.

      I have also observed while this was happening that other nodes were trying to establish connections (SYN packets sent) but the trouble node (A) were not picking up the line (no accept()).

      Not knowing exactly how Gossiper works here but it looks like node A is sending out some gossiper echo messages, but then is too busy to get the replies, and never retries.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              forsberg Erik Forsberg
              Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: