Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-15902

OOM because repair session thread not closed when terminating repair

    XMLWordPrintableJSON

Details

    Description

      In our cluster, after a while some nodes running slowly out of memory. On that nodes we observed that Cassandra Reaper terminate repairs with a JMX call to StorageServiceMBean.forceTerminateAllRepairSessions() because reaching timeout of 30 min.

      In the memory heap dump we see lot of instances of io.netty.util.concurrent.FastThreadLocalThread occupy most of the memory:

      119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by "sun.misc.Launcher$AppClassLoader @ 0x51a800000" occupy 8.445.684.480 (93,96 %) bytes. 

      In the thread dump we see lot of repair threads:

      grep "Repair#" threaddump.txt | wc -l
            50 

       

      The repair jobs are waiting for the validation to finish:

      "Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x0000000012fc5000 nid=0x542a waiting on condition [0x00007f81ee414000]
         java.lang.Thread.State: WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00000007939bcfc8> (a com.google.common.util.concurrent.AbstractFuture$Sync)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
              at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
              at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
              at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
              at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
              at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
              at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown Source)
              at java.lang.Thread.run(Thread.java:748) 

       

      Thats the line where the threads stuck:

      // Wait for validation to complete
      Futures.getUnchecked(validations); 

       

      The call to StorageServiceMBean.forceTerminateAllRepairSessions() stops the thread pool executor. It looks like that futures which are in progress will therefor never be completed and the repair thread waits forever and won't be finished.

       

      Environment:

      Cassandra version: 3.11.4 and 3.11.6

      Cassandra Reaper: 1.4.0

      JVM memory settings:

      -Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M 

      on another cluster with same issue:

      -Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M 

      Java Runtime:

      openjdk version "1.8.0_212"
      OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
      OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode) 

       

      The same issue described in this comment: https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973

      As suggested in the comments I created this new specific ticket.

      Attachments

        1. repair-terminated.txt
          9 kB
          Swen Fuhrmann
        2. heap-mem-histo.txt
          316 kB
          Swen Fuhrmann

        Issue Links

          Activity

            People

              moczarski Swen Fuhrmann
              moczarski Swen Fuhrmann
              Swen Fuhrmann
              Alexander Dejanovski, Brandon Williams
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: