Details
-
Bug
-
Status: Resolved
-
Normal
-
Resolution: Fixed
-
4.1.4, 5.0-alpha1, 5.0, 5.1
-
None
-
Degradation - Resource Management
-
Critical
-
Normal
-
User Report
-
All
-
None
-
Description
I've observed in a recent stack trace from a node running 4.1 what looks like a deadlock around the StreamSession monitor lock when StreamSession#receive() waits via syncUninteruptibly() for a response to a control message.
"Messaging-EventLoop-3-10" #320 daemon prio=5 os_prio=0 cpu=57979617.98ms elapsed=5587916.03s tid=0x00007f056e88ae00 nid=0x80ec waiting for monitor entry [0x00007f056d277000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:524) - waiting to lock <0x00000006816fae70> (a org.apache.cassandra.streaming.StreamSession) at org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:690) at org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.onMessageComplete(StreamingMultiplexedChannel.java:264) at org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.lambda$sendMessage$1(StreamingMultiplexedChannel.java:233) at org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$$Lambda$2029/0x00000008007a0c40.operationComplete(Unknown Source) at org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:134) at org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:148) at org.apache.cassandra.utils.concurrent.ListenerList$GenericFutureListenerList.notifySelf(ListenerList.java:190) at org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124) at org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$950/0x0000000800666040.accept(Unknown Source) at org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195) at org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124) at org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96) at org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104) at org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148) at org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139) at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1009) at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:870) at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)
It seems that while receive()} is holding the monitor lock on {{StreamSession, the callback that executes on a different thread for the control message it sends carries an error. This error, when handled in onError(), then calls closeSession(), which tries to acquire the monitor lock already held in receive().
"Stream-Deserializer-/100.70.229.6:7000-de724029" #1919000 daemon prio=5 os_prio=0 cpu=224.66ms elapsed=1604976.92s tid=0x00007f0561c66500 nid=0xe2a2 waiting on condition [0x00007f0830947000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:323) at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289) at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282) at org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306) at org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154) at org.apache.cassandra.utils.concurrent.AsyncPromise.await(AsyncPromise.java:244) at org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:127) at org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:34) at org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186) at org.apache.cassandra.utils.concurrent.AbstractFuture.awaitUninterruptibly(AbstractFuture.java:482) at org.apache.cassandra.utils.concurrent.AsyncPromise.awaitUninterruptibly(AsyncPromise.java:254) at org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:133) at org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:34) at org.apache.cassandra.utils.concurrent.Future.syncUninterruptibly(Future.java:94) at org.apache.cassandra.utils.concurrent.AsyncPromise.syncUninterruptibly(AsyncPromise.java:186) at org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:121) at org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:34) at org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:1054) at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:628) - locked <0x00000006816fae70> (a org.apache.cassandra.streaming.StreamSession) at org.apache.cassandra.streaming.StreamDeserializingTask.run(StreamDeserializingTask.java:76) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)
The most straightforward way to fix this might be to put an upper bound on the wait in receive(), along w/ some logging at WARN if it times out. This would at least allow us to make progress eventually and close the session properly. The syncUninterruptibly() was intended to avoid a race during shutdown, IIRC, and unless the timeout is comically low, it shouldn't compromise that.
The problem can usually be fixed by bouncing affected nodes, who will usually present w/ an increasing backlog of unrepaired data and a log message that looks something like this, in addition to a number of streaming errors:
INFO 2023-08-04T10:45:54,845 [NettyStreaming-Outbound-/<ip:port>] org.apache.cassandra.streaming.async.StreamingMultiplexedChannel:359 - [Stream #10465c10-3108-11ee-af8a-3f74fd21ad9d] waiting to acquire a permit to begin streaming <data directory>-nb-101920-big-Data.db. This message logs every 3 minutes
Attachments
Issue Links
- is related to
-
CASSANDRA-18815 Fix dtests: replace_address_test.TestReplaceAddress.test_restart_failed_replace and others
- Resolved
- relates to
-
CASSANDRA-17116 When streaming sees a ClosedChannelException this triggers the disk failure policy
- Resolved