Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-13810

Overload because of hint pressure + MVs

    XMLWordPrintableJSON

Details

    • Availability - Unavailable
    • Critical

    Description

      Cluster setup: 3 DCs, 20 Cassandra nodes each, all 3.0.14, with approx. 200GB data per machine. Many tables have MVs associated.

      During some maintenance we did a rolling restart of all nodes in the cluster. This caused a buildup of hints/batches, as expected. Most nodes came back just fine, except for two nodes.

      These two nodes came back with a loadavg of >100, and 'nodetool tpstats' showed a million (not exaggerating) MutationStage tasks per second. It was clear that these were mostly (all?) mutations coming from hints, as indicated by thousands of log entries per second in debug.log :

      DEBUG [SharedPool-Worker-107] 2017-08-27 13:16:51,098 HintVerbHandler.java:95 - Failed to apply hint
      java.util.concurrent.CompletionException: org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - received only 0 responses.
          at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[na:1.8.0_144]
          at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[na:1.8.0_144]
          at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:647) ~[na:1.8.0_144]
          at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:632) ~[na:1.8.0_144]
          at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[na:1.8.0_144]
          at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[na:1.8.0_144]
          at org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:481) ~[apache-cassandra-3.0.14.jar:3.0.14]
          at org.apache.cassandra.db.Keyspace.lambda$applyInternal$0(Keyspace.java:495) ~[apache-cassandra-3.0.14.jar:3.0.14]
          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_144]
          at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:164) ~[apache-cassandra-3.0.14.jar:3.0.14]
          at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) ~[apache-cassandra-3.0.14.jar:3.0.14]
          at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_144]
      Caused by: org.apache.cassandra.exceptions.WriteTimeoutException: Operation timed out - received only 0 responses.
          ... 6 common frames omitted
      

      After reading the relevant code, it seems that a hint is considered droppable, and in the mutation path when the table contains a MV and the lock fails to acquire and the mutation is droppable, it throws a WTE without waiting until the timeout expires. This explains why Cassandra is able to process a million mutations per second without actually considering them 'dropped' in the 'nodetool tpstats' output.

      I managed to recover the two nodes by stopping handoffs on all nodes in the cluster and reenabling them one at a time. It's likely that the hint/batchlog settings were sub-optimal on this cluster, but I think that the retry behavior of hints should be improved as it's hard to express hint throughput in kb/s when the mutations can involve MVs.

      More data available upon request – I'm not sure which bits are relevant and which aren't.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              tvdw Tom van der Woerdt
              Votes:
              2 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated: