Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-13095

Timeouts between nodes

    XMLWordPrintableJSON

Details

    • Low

    Description

      Recently I've run into a problem with heavily loaded cluster when sometimes messages between certain nodes become blocked with no reason.

      It looks like the same situation that described here https://issues.apache.org/jira/browse/CASSANDRA-12676?focusedCommentId=15736166&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736166

      Thread dump showed infinite loop here: https://github.com/apache/cassandra/blob/a8a43dd32eb92406d7d8b105e08c68b3d5c7df49/src/java/org/apache/cassandra/utils/CoalescingStrategies.java#L109

      Apparently the problem is in the initial value of epoch filed in TimeHorizonMovingAverageCoalescingStrategy class. When it's value is not evenly divisible by BUCKET_INTERVAL, ix(epoch-1) does not point to the correct bucket. As a result, sum gradually increases and, upon reaching MEASURED_INTERVAL, averageGap becomes 0 and thread blocks.

      It's hard to reproduce because it takes a long time for sum to grow and when no messages are send for some time, sum becomes 0 https://github.com/apache/cassandra/blob/a8a43dd32eb92406d7d8b105e08c68b3d5c7df49/src/java/org/apache/cassandra/utils/CoalescingStrategies.java#L301 and bug is no longer reproducible (until connection between nodes is re-created).

      I've added a patch which should fix the problem. Don't know if it would be of any help since CASSANDRA-12676 will apparently disable this behaviour. One note about performance regressions though. There is a small chance it being result of the bug described here, so it might be worth testing performance after fixes and/or tuning the algorithm.

      Attachments

        1. 13095-2.1.patch
          3 kB
          Danil Smirnov

        Activity

          People

            d.smirnov Danil Smirnov
            d.smirnov Danil Smirnov
            Danil Smirnov
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: