[CASSANDRA-18733] Waiting indefinitely on ReceivedMessage response in StreamSession#receive() can cause deadlock - ASF JIRA

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.1.4, 5.0-alpha1, 5.0, 5.1
Component/s: Consistency/Repair, Consistency/Streaming
Labels:
None

Bug Category:
Degradation - Resource Management
Severity:
Critical
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

4.1.0
Source Control Link:

https://github.com/apache/cassandra/commit/bde4fa0013eb8cec5b1d88b21ca4463bc07272bb
Test and Documentation Plan:

Hide

Added StreamDisconnectedWhileReceivingTest

Show
Added StreamDisconnectedWhileReceivingTest

Description

I've observed in a recent stack trace from a node running 4.1 what looks like a deadlock around the StreamSession monitor lock when StreamSession#receive() waits via syncUninteruptibly() for a response to a control message.

"Messaging-EventLoop-3-10" #320 daemon prio=5 os_prio=0 cpu=57979617.98ms elapsed=5587916.03s tid=0x00007f056e88ae00 nid=0x80ec waiting for monitor entry  [0x00007f056d277000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:524)
        - waiting to lock <0x00000006816fae70> (a org.apache.cassandra.streaming.StreamSession)
        at org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:690)
        at org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.onMessageComplete(StreamingMultiplexedChannel.java:264)
        at org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.lambda$sendMessage$1(StreamingMultiplexedChannel.java:233)
        at org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$$Lambda$2029/0x00000008007a0c40.operationComplete(Unknown Source)
        at org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:134)
        at org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:148)
        at org.apache.cassandra.utils.concurrent.ListenerList$GenericFutureListenerList.notifySelf(ListenerList.java:190)
        at org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
        at org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$950/0x0000000800666040.accept(Unknown Source)
        at org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
        at org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
        at org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
        at org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
        at org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
        at org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1009)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:870)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
        at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)

It seems that while receive()} is holding the monitor lock on {{StreamSession, the callback that executes on a different thread for the control message it sends carries an error. This error, when handled in onError(), then calls closeSession(), which tries to acquire the monitor lock already held in receive().

"Stream-Deserializer-/100.70.229.6:7000-de724029" #1919000 daemon prio=5 os_prio=0 cpu=224.66ms elapsed=1604976.92s tid=0x00007f0561c66500 nid=0xe2a2 waiting on condition  [0x00007f0830947000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
        at java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:323)
        at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
        at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
        at org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
        at org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154)
        at org.apache.cassandra.utils.concurrent.AsyncPromise.await(AsyncPromise.java:244)
        at org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:127)
        at org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:34)
        at org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
        at org.apache.cassandra.utils.concurrent.AbstractFuture.awaitUninterruptibly(AbstractFuture.java:482)
        at org.apache.cassandra.utils.concurrent.AsyncPromise.awaitUninterruptibly(AsyncPromise.java:254)
        at org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:133)
        at org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:34)
        at org.apache.cassandra.utils.concurrent.Future.syncUninterruptibly(Future.java:94)
        at org.apache.cassandra.utils.concurrent.AsyncPromise.syncUninterruptibly(AsyncPromise.java:186)
        at org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:121)
        at org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:34)
        at org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:1054)
        at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:628)
        - locked <0x00000006816fae70> (a org.apache.cassandra.streaming.StreamSession)
        at org.apache.cassandra.streaming.StreamDeserializingTask.run(StreamDeserializingTask.java:76)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)

The most straightforward way to fix this might be to put an upper bound on the wait in receive(), along w/ some logging at WARN if it times out. This would at least allow us to make progress eventually and close the session properly. The syncUninterruptibly() was intended to avoid a race during shutdown, IIRC, and unless the timeout is comically low, it shouldn't compromise that.

The problem can usually be fixed by bouncing affected nodes, who will usually present w/ an increasing backlog of unrepaired data and a log message that looks something like this, in addition to a number of streaming errors:

INFO  2023-08-04T10:45:54,845 [NettyStreaming-Outbound-/<ip:port>] org.apache.cassandra.streaming.async.StreamingMultiplexedChannel:359 - [Stream #10465c10-3108-11ee-af8a-3f74fd21ad9d] waiting to acquire a permit to begin streaming <data directory>-nb-101920-big-Data.db. This message logs every 3 minutes

Attachments

Issue Links

is related to

CASSANDRA-18815 Fix dtests: replace_address_test.TestReplaceAddress.test_restart_failed_replace and others

Resolved

relates to

CASSANDRA-17116 When streaming sees a ClosedChannelException this triggers the disk failure policy

Resolved

Waiting indefinitely on ReceivedMessage response in StreamSession#receive() can cause deadlock

Details

Description

Attachments

Issue Links

Activity

People

Dates