Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-14804

Running repair on multiple nodes in parallel could halt entire repair

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Normal
    • Resolution: Fixed
    • 3.0.18
    • Consistency/Repair
    • None
    • Normal

    Description

      Possible deadlock if we run repair on multiple nodes at the same time. We have come across a situation in production in which if we repair multiple nodes at the same time then repair hangs forever. Here are the details:

      Time t1
      node-1 has issued repair command to node-2 but due to some reason node-2 didn't receive request hence node-1 is awaiting at prepareForRepair for 1 hour with lock

      Time t2
      node-2 sent prepare repair request to node-1, some exception occurred on node-1 and it is trying to cleanup parent session here but node-1 cannot get lock as 1 hour of time has not yet elapsed (above one)

      snippet of jstack on node-1

      "Thread-888" #262588 daemon prio=5 os_prio=0 waiting on condition
      java.lang.Thread.State: TIMED_WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for (a java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
        at org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:332)
      • locked <> (a org.apache.cassandra.service.ActiveRepairService)
        at org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:214)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
        at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:748)

      "AntiEntropyStage:1" #1789 daemon prio=5 os_prio=0 waiting for monitor entry []
      java.lang.Thread.State: BLOCKED (on object monitor)
      at org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:421)

      • waiting to lock <> (a org.apache.cassandra.service.ActiveRepairService)
        at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:172)
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
        at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown Source)
        at java.lang.Thread.run(Thread.java:748)

      Time t3:
      node-2(and possibly other nodes node-3…) sent prepare request to node-1, but node-1’s AntiEntropyStage thread is busy awaiting for lock at ActiveRepairService.removeParentRepairSession, hence node-2, node-3 (and possibly other nodes) will also go in 1 hour wait with lock. This rolling effect continues and stalls repair in entire ring.

      If we totally stop triggering repair then system would recover slowly but here are the two major problems with this:
      1. Externally there is no way to decide whether to trigger new repair or wait for system to recover
      2. In this case system recovers eventually but it takes probably n hours where n = #of repair requests fired, only way to come out of this situation is either to do a rolling restart of entire ring or wait for n hours before triggering new repair request

      Please let me know if my above analysis makes sense or not.

      Attachments

        Activity

          People

            Unassigned Unassigned
            chovatia.jaydeep@gmail.com Jaydeepkumar Chovatia
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: