Details
-
Bug
-
Status: Open
-
Normal
-
Resolution: Unresolved
-
None
-
None
-
Normal
Description
I'm seeing semi-frequent instances of nodetool compactionstats hanging forever. While this is occurring none of the compaction metrics are available via jmx/jconsole.
Sometimes the node will eventually recover, but I'm seeing a pattern where a node will exhibit this behavior, and then eventually pending mutations start piling up and the node dies due to OOM. Sometimes pending gossip operations starting piling up too, but I think this is due to the impending OOM causing everything to bog down.
As an experiment I turned auto compaction off on all the nodes and I haven't seen this issue occur since I did that. Additionally, I'm running relocatesstables on some nodes with unthrottled compaction and so far none of them have had any issues handling it.
I managed to get some stack traces from a dying node:
All MutationStage threads look similar to this:
Name: MutationStage-10 State: WAITING Total blocked: 9 Total waited: 1,959,850 Stack trace: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(Unknown Source) org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUninterruptibly(WaitQueue.java:279) org.apache.cassandra.utils.memory.MemtableAllocator$SubAllocator.allocate(MemtableAllocator.java:162) org.apache.cassandra.utils.memory.SlabAllocator.allocate(SlabAllocator.java:89) org.apache.cassandra.utils.memory.ContextAllocator.allocate(ContextAllocator.java:57) org.apache.cassandra.utils.memory.ContextAllocator.clone(ContextAllocator.java:47) org.apache.cassandra.db.rows.BufferCell.copy(BufferCell.java:122) org.apache.cassandra.utils.memory.AbstractAllocator$CloningBTreeRowBuilder.addCell(AbstractAllocator.java:72) org.apache.cassandra.db.rows.Rows.copy(Rows.java:51) org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:332) org.apache.cassandra.db.partitions.AtomicBTreePartition$RowUpdater.apply(AtomicBTreePartition.java:295) org.apache.cassandra.utils.btree.NodeBuilder.addNewKey(NodeBuilder.java:323) org.apache.cassandra.utils.btree.NodeBuilder.update(NodeBuilder.java:184) org.apache.cassandra.utils.btree.TreeBuilder.update(TreeBuilder.java:95) org.apache.cassandra.utils.btree.BTree.update(BTree.java:182) org.apache.cassandra.db.partitions.AtomicBTreePartition.addAllWithSizeDelta(AtomicBTreePartition.java:156) org.apache.cassandra.db.Memtable.put(Memtable.java:284)org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:1316) org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:618) org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:425) org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:222) org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:68) org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) java.lang.Thread.run(Unknown Source)
The Compaction threads
Name: CompactionExecutor:4 State: RUNNABLE Total blocked: 32,781,277 Total waited: 549 Stack trace: org.apache.cassandra.io.sstable.format.SSTableReader.getTotalBytes(SSTableReader.java:661) org.apache.cassandra.db.compaction.LeveledManifest.getCandidatesFor(LeveledManifest.java:669) org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:385) - locked org.apache.cassandra.db.compaction.LeveledManifest@55c79600 org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:119) org.apache.cassandra.db.compaction.CompactionStrategyManager.getNextBackgroundTask(CompactionStrategyManager.java:119) org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:261) java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) java.util.concurrent.FutureTask.run(Unknown Source) java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) java.util.concurrent.FutureTask.run(Unknown Source) java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$4/45023307.run(Unknown Source) java.lang.Thread.run(Unknown Source)
Name: CompactionExecutor:1 State: BLOCKED on org.apache.cassandra.db.compaction.LeveledManifest@55c79600 owned by: CompactionExecutor:2 Total blocked: 116,196,349 Total waited: 4,771 Stack trace: org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:310) org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:119) org.apache.cassandra.db.compaction.CompactionStrategyManager.getNextBackgroundTask(CompactionStrategyManager.java:119) org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:261) java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) java.util.concurrent.FutureTask.run(Unknown Source) java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) java.util.concurrent.FutureTask.run(Unknown Source) java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$4/45023307.run(Unknown Source) java.lang.Thread.run(Unknown Source)
Name: CompactionExecutor:2 State: BLOCKED on org.apache.cassandra.db.compaction.LeveledManifest@55c79600 owned by: CompactionExecutor:1 Total blocked: 26,542,909 Total waited: 4,373 Stack trace: org.apache.cassandra.db.compaction.LeveledManifest.getCompactionCandidates(LeveledManifest.java:310) org.apache.cassandra.db.compaction.LeveledCompactionStrategy.getNextBackgroundTask(LeveledCompactionStrategy.java:119) org.apache.cassandra.db.compaction.CompactionStrategyManager.getNextBackgroundTask(CompactionStrategyManager.java:119) org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:261) java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) java.util.concurrent.FutureTask.run(Unknown Source) java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) java.util.concurrent.FutureTask.run(Unknown Source) java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79) org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$4/45023307.run(Unknown Source) java.lang.Thread.run(Unknown Source)
Tp stats from the node
Pool Name Active Pending Completed Blocked All time blocked MutationStage 32 762545 53122841 0 0 ViewMutationStage 0 0 0 0 0 ReadStage 0 0 247792 0 0 RequestResponseStage 0 0 200621 0 0 ReadRepairStage 0 0 2489 0 0 CounterMutationStage 0 0 0 0 0 MiscStage 0 0 0 0 0 CompactionExecutor 3 58 26816 0 0 MemtableReclaimMemory 0 0 176 0 0 PendingRangeCalculator 0 0 84 0 0 GossipStage 0 0 235500 0 0 SecondaryIndexManagement 0 0 0 0 0 HintsDispatcher 0 0 749 0 0 PerDiskMemtableFlushWriter_1 0 0 156 0 0 PerDiskMemtableFlushWriter_2 0 0 156 0 0 MigrationStage 1 25 73 0 0 MemtablePostFlush 1 25 1953 0 0 PerDiskMemtableFlushWriter_0 0 0 176 0 0 ValidationExecutor 0 0 1320 0 0 Sampler 0 0 0 0 0 MemtableFlushWriter 1 18 176 0 0 InternalResponseStage 0 0 3030 0 0 AntiEntropyStage 0 0 4087 0 0 CacheCleanupExecutor 0 0 0 0 0 Native-Transport-Requests 0 0 670925 0 0