[CASSANDRA-12689] All MutationStage threads blocked, kills server - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Urgent
Resolution: Fixed
Fix Version/s: 3.0.10, 3.10
Component/s: Feature/Materialized Views, Legacy/Local Write-Read Paths
Labels:
None

Severity:
Critical

Description

Under heavy load (e.g. due to repair during normal operations), a lot of NullPointerExceptions occur in MutationStage. Unfortunately, the log is not very chatty, trace is missing:

2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught exception on thread Thread[MutationStage-1,5,main]: {}
2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null

Then, after some time, in most cases ALL threads in MutationStage pools are completely blocked. This leads to piling up pending tasks until server runs OOM and is completely unresponsive due to GC. Threads will NEVER unblock until server restart. Even if load goes completely down, all hints are paused, and no compaction or repair is running. Only restart helps.

I can understand that pending tasks in MutationStage may pile up under heavy load, but tasks should be processed and dequeud after load goes down. This is definitively not the case. This looks more like a an unhandled exception leading to a stuck lock.

Stack trace from jconsole, all Threads in MutationStage show same trace.

Name: MutationStage-48
State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
Total blocked: 137  Total waited: 138.513

Stack trace:

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
org.apache.cassandra.hints.Hint.apply(Hint.java:96)
org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
java.lang.Thread.run(Thread.java:745)

Attachments

Issue Links

is related to

CASSANDRA-12905 Retry acquire MV lock on failure instead of throwing WTE on streaming

Resolved

Activity

People

Assignee:: Benjamin Roth

Reporter:: Benjamin Roth

Authors:: Benjamin Roth

Reviewers:: Tom Hobbs

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Dates

Created:: 22/Sep/16 16:59

Updated:: 16/Apr/19 09:30

Resolved:: 28/Oct/16 20:48