Fix Version/s: None
Bug Category:Degradation - Resource Management
Discovered By:User Report
(We reported this to https://github.com/thelastpickle/cassandra-reaper/issues/898, as the behavior can be triggered by reaper. I will copy-paste here and rephrase slightly..)
We have a fairly big table (240GB per node) where the reaper repairs would kept failing as they get killed by reaper's handlePotentialStuckRepairs, which calls ActiveRepairService#terminateSessions.
On this cluster (with G1GC), we also experience memory leak, where the old gen would keep growing, until JVM has to do minutes-long full GC, which still couldn't recover much memory from the old gen.
From heapdump, we eventually trace the memory leak to dozens of RepairJob threads, each one holding on to hundreds of megabytes of MerkleTrees objects.
The threads would look like this in jmap output (cassandra 3.11.4):
After checking the code, we think this is what happens:
1. reaper schedules repair #1 to node A
2. node A requests merkle trees from neighboring node B and C
3. node B finishes validation phase, sends merkle tree to node A
4. node C finishes validation phase, sends merkle tree to node A
5. reaper schedules repair #2, calls `handlePotentialStuckRepairs`
6. node A finishes validation phase
7. node A starts sync phase
8. repair #1 on node A, B, and C all stuck indefinitely, as the executor was already shutdown by `handlePotentialStuckRepairs`, and thus nobody would pick up the sync task