Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-13923

Flushers blocked due to many SSTables

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Duplicate
    • None
    • None
    • Cassandra 3.11.0
      Centos 6 (downgraded JNA)
      64GB RAM
      12-disk JBOD

    • Normal

    Description

      This started on the mailing list and I'm not 100% sure of the root cause, feel free to re-title if needed.

      I just upgraded Cassandra from 2.2.6 to 3.11.0. Within a few hours of serving traffic, thread pools begin to back up and grow pending tasks indefinitely. This happens to multiple different stages (Read, Mutation) and consistently builds pending tasks for MemtablePostFlush and MemtableFlushWriter.

      Using jstack shows that there is blocking going on when trying to call getCompactionCandidates, which seems to happen on flush. We have fairly large nodes that have ~15,000 SSTables per node, all LCS.

      I seems like this can cause reads to get blocked because they try to acquire a read lock when calling shouldDefragment.

      And writes, of course, block once we can't allocate anymore memtables, because flushes are backed up.

      We did not have this problem in 2.2.6, so it seems like there is some regression causing it to be incredibly slow trying to do calls like getCompactionCandidates that list out the SSTables.

      In our case this causes nodes to build up pending tasks and simply stop responding to requests.

      Attachments

        1. cassandra-jstack-readstage.txt
          241 kB
          Dan Kinder
        2. cassandra-jstack.txt
          396 kB
          Dan Kinder

        Issue Links

          Activity

            People

              marcuse Marcus Eriksson
              dkinder Dan Kinder
              Marcus Eriksson
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: