Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-12689

All MutationStage threads blocked, kills server

Agile BoardAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Critical

    Description

      Under heavy load (e.g. due to repair during normal operations), a lot of NullPointerExceptions occur in MutationStage. Unfortunately, the log is not very chatty, trace is missing:

      2016-09-22T06:29:47+00:00 cas6 [MutationStage-1] org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService Uncaught exception on thread Thread[MutationStage-1,5,main]: {}
      2016-09-22T06:29:47+00:00 cas6 #011java.lang.NullPointerException: null
      

      Then, after some time, in most cases ALL threads in MutationStage pools are completely blocked. This leads to piling up pending tasks until server runs OOM and is completely unresponsive due to GC. Threads will NEVER unblock until server restart. Even if load goes completely down, all hints are paused, and no compaction or repair is running. Only restart helps.

      I can understand that pending tasks in MutationStage may pile up under heavy load, but tasks should be processed and dequeud after load goes down. This is definitively not the case. This looks more like a an unhandled exception leading to a stuck lock.

      Stack trace from jconsole, all Threads in MutationStage show same trace.

      Name: MutationStage-48
      State: WAITING on java.util.concurrent.CompletableFuture$Signaller@fcc8266
      Total blocked: 137  Total waited: 138.513
      

      Stack trace:

      sun.misc.Unsafe.park(Native Method)
      java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
      java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
      java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
      java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
      com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
      org.apache.cassandra.db.Mutation.apply(Mutation.java:227)
      org.apache.cassandra.db.Mutation.apply(Mutation.java:241)
      org.apache.cassandra.hints.Hint.apply(Hint.java:96)
      org.apache.cassandra.hints.HintVerbHandler.doVerb(HintVerbHandler.java:91)
      org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
      java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
      org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134)
      org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109)
      java.lang.Thread.run(Thread.java:745)
      

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            brstgt Benjamin Roth Assign to me
            brstgt Benjamin Roth
            Benjamin Roth
            Tom Hobbs
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment