Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-19761

When JVM dtest is shutting down, if a new epoch is being committed the node is unable to shut down

    XMLWordPrintableJSON

Details

    Description

      The following was seen in the accord branch, but the problem is found in trunk as well.

      node1_isolatedExecutor:8:
      	java.base@11.0.15/jdk.internal.misc.Unsafe.park(Native Method)
      	java.base@11.0.15/java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:234)
      	org.apache.cassandra.simulator.systems.InterceptorOfSystemMethods$None.parkNanos(InterceptorOfSystemMethods.java:373)
      	org.apache.cassandra.simulator.systems.InterceptorOfSystemMethods$Global.parkNanos(InterceptorOfSystemMethods.java:166)
      	java.base@11.0.15/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2123)
      	java.base@11.0.15/java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1454)
      	org.apache.cassandra.utils.ExecutorUtils.awaitTerminationUntil(ExecutorUtils.java:110)
      	org.apache.cassandra.utils.ExecutorUtils.awaitTermination(ExecutorUtils.java:100)
      	org.apache.cassandra.concurrent.Stage.shutdownAndWait(Stage.java:195)
      	org.apache.cassandra.distributed.impl.Instance.lambda$shutdown$44(Instance.java:975)
      
      node1_MiscStage:1:
      	java.base@11.0.15/jdk.internal.misc.Unsafe.park(Native Method)
      	java.base@11.0.15/java.util.concurrent.locks.LockSupport.park(LockSupport.java:323)
      	org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:290)
      	org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:283)
      	org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
      	org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:338)
      	org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
      	org.apache.cassandra.utils.concurrent.Awaitable$AbstractAwaitable.awaitUninterruptibly(Awaitable.java:259)
      	org.apache.cassandra.tcm.log.LocalLog$Async.runOnce(LocalLog.java:710)
      	org.apache.cassandra.tcm.log.LocalLog.runOnce(LocalLog.java:404)
      	org.apache.cassandra.tcm.log.LocalLog.waitForHighestConsecutive(LocalLog.java:346)
      	org.apache.cassandra.tcm.PaxosBackedProcessor.fetchLogAndWait(PaxosBackedProcessor.java:163)
      	org.apache.cassandra.tcm.AbstractLocalProcessor.commit(AbstractLocalProcessor.java:109)
      	org.apache.cassandra.distributed.test.log.TestProcessor.commit(TestProcessor.java:61)
      	org.apache.cassandra.tcm.ClusterMetadataService$SwitchableProcessor.commit(ClusterMetadataService.java:841)
      	org.apache.cassandra.tcm.Processor.commit(Processor.java:45)
      	org.apache.cassandra.tcm.ClusterMetadataService.commit(ClusterMetadataService.java:516)
      	org.apache.cassandra.service.accord.AccordFastPathCoordinator$Impl.lambda$updateFastPath$2(AccordFastPathCoordinator.java:208)
      	org.apache.cassandra.service.accord.AccordFastPathCoordinator$Impl$$Lambda$11211/0x0000000802441840.run(Unknown Source)
      

      Accord is trying to commit a new epoch, but TCM uses “awaitUninterruptibly” which ignores the thread interrupt done while the cluster is shutting down. When this is happening the instance is unable to make progress so loops endlessly, causing the test to fail to close.

      Attachments

        1. ci_summary.html
          39 kB
          Sam Tunnicliffe
        2. ci_summary-1.html
          41 kB
          Sam Tunnicliffe

        Issue Links

          Activity

            People

              samt Sam Tunnicliffe
              dcapwell David Capwell
              Sam Tunnicliffe
              Alex Petrov
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: