Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
None
-
Normal
Description
Possible deadlock if we run repair on multiple nodes at the same time. We have come across a situation in production in which if we repair multiple nodes at the same time then repair hangs forever. Here are the details:
Time t1
node-1 has issued repair command to node-2 but due to some reason node-2 didn't receive request hence node-1 is awaiting at prepareForRepair for 1 hour with lock
Time t2
node-2 sent prepare repair request to node-1, some exception occurred on node-1 and it is trying to cleanup parent session here but node-1 cannot get lock as 1 hour of time has not yet elapsed (above one)
snippet of jstack on node-1
"Thread-888" #262588 daemon prio=5 os_prio=0 waiting on condition
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for (a java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.cassandra.service.ActiveRepairService.prepareForRepair(ActiveRepairService.java:332)- locked <> (a org.apache.cassandra.service.ActiveRepairService)
at org.apache.cassandra.repair.RepairRunnable.runMayThrow(RepairRunnable.java:214)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)"AntiEntropyStage:1" #1789 daemon prio=5 os_prio=0 waiting for monitor entry []
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:421)
- waiting to lock <> (a org.apache.cassandra.service.ActiveRepairService)
at org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:172)
at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
at org.apache.cassandra.concurrent.NamedThreadFactory$$Lambda$9/864248990.run(Unknown Source)
at java.lang.Thread.run(Thread.java:748)
Time t3:
node-2(and possibly other nodes node-3…) sent prepare request to node-1, but node-1’s AntiEntropyStage thread is busy awaiting for lock at ActiveRepairService.removeParentRepairSession, hence node-2, node-3 (and possibly other nodes) will also go in 1 hour wait with lock. This rolling effect continues and stalls repair in entire ring.
If we totally stop triggering repair then system would recover slowly but here are the two major problems with this:
1. Externally there is no way to decide whether to trigger new repair or wait for system to recover
2. In this case system recovers eventually but it takes probably n hours where n = #of repair requests fired, only way to come out of this situation is either to do a rolling restart of entire ring or wait for n hours before triggering new repair request
Please let me know if my above analysis makes sense or not.