[CASSANDRA-15902] OOM because repair session thread not closed when terminating repair - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 3.0.23, 3.11.9, 4.0-beta3, 4.0
Component/s: Consistency/Repair
Labels:
None

Bug Category:
Degradation - Resource Management
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

3.0.0
Source Control Link:

https://github.com/apache/cassandra/commit/45ad38fb5aec76418589c07d88fd0ca27fb430f4
Test and Documentation Plan:
- Add unit test exposing the issue
- For trunk, add only regression test as unit test

Description

In our cluster, after a while some nodes running slowly out of memory. On that nodes we observed that Cassandra Reaper terminate repairs with a JMX call to StorageServiceMBean.forceTerminateAllRepairSessions() because reaching timeout of 30 min.

In the memory heap dump we see lot of instances of io.netty.util.concurrent.FastThreadLocalThread occupy most of the memory:

119 instances of "io.netty.util.concurrent.FastThreadLocalThread", loaded by "sun.misc.Launcher$AppClassLoader @ 0x51a800000" occupy 8.445.684.480 (93,96 %) bytes.

In the thread dump we see lot of repair threads:

grep "Repair#" threaddump.txt | wc -l
      50

The repair jobs are waiting for the validation to finish:

"Repair#152:1" #96170 daemon prio=5 os_prio=0 tid=0x0000000012fc5000 nid=0x542a waiting on condition [0x00007f81ee414000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007939bcfc8> (a com.google.common.util.concurrent.AbstractFuture$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
        at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:285)
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
        at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
        at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1509)
        at org.apache.cassandra.repair.RepairJob.run(RepairJob.java:160)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
        at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$13/480490520.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:748)

Thats the line where the threads stuck:

// Wait for validation to complete
Futures.getUnchecked(validations);

The call to StorageServiceMBean.forceTerminateAllRepairSessions() stops the thread pool executor. It looks like that futures which are in progress will therefor never be completed and the repair thread waits forever and won't be finished.

Environment:

Cassandra version: 3.11.4 and 3.11.6

Cassandra Reaper: 1.4.0

JVM memory settings:

-Xms11771M -Xmx11771M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M

on another cluster with same issue:

-Xms31744M -Xmx31744M -XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:+ParallelRefProcEnabled -XX:MaxMetaspaceSize=100M

Java Runtime:

openjdk version "1.8.0_212"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_212-b03)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.212-b03, mixed mode)

The same issue described in this comment: https://issues.apache.org/jira/browse/CASSANDRA-14355?focusedCommentId=16992973&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16992973

As suggested in the comments I created this new specific ticket.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

repair-terminated.txt
25/Jun/20 14:29
9 kB
Swen Fuhrmann
heap-mem-histo.txt
25/Jun/20 15:54
316 kB
Swen Fuhrmann

Issue Links

is broken by

CASSANDRA-14332 Fix unbounded validation compactions on repair

Resolved

is fixed by

CASSANDRA-13797 RepairJob blocks on syncTasks

Resolved

is related to

CASSANDRA-13555 Thread leak during repair

Resolved

Activity

People

Assignee:: Swen Fuhrmann

Reporter:: Swen Fuhrmann

Authors:: Swen Fuhrmann

Reviewers:: Alexander Dejanovski, Brandon Williams

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 25/Jun/20 13:50

Updated:: 03/Nov/20 07:48

Resolved:: 28/Oct/20 14:01