Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-18733

Waiting indefinitely on ReceivedMessage response in StreamSession#receive() can cause deadlock

    XMLWordPrintableJSON

Details

    Description

      I've observed in a recent stack trace from a node running 4.1 what looks like a deadlock around the StreamSession monitor lock when StreamSession#receive() waits via syncUninteruptibly() for a response to a control message.

      "Messaging-EventLoop-3-10" #320 daemon prio=5 os_prio=0 cpu=57979617.98ms elapsed=5587916.03s tid=0x00007f056e88ae00 nid=0x80ec waiting for monitor entry  [0x00007f056d277000]
         java.lang.Thread.State: BLOCKED (on object monitor)
              at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:524)
              - waiting to lock <0x00000006816fae70> (a org.apache.cassandra.streaming.StreamSession)
              at org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:690)
              at org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.onMessageComplete(StreamingMultiplexedChannel.java:264)
              at org.apache.cassandra.streaming.async.StreamingMultiplexedChannel.lambda$sendMessage$1(StreamingMultiplexedChannel.java:233)
              at org.apache.cassandra.streaming.async.StreamingMultiplexedChannel$$Lambda$2029/0x00000008007a0c40.operationComplete(Unknown Source)
              at org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:134)
              at org.apache.cassandra.utils.concurrent.ListenerList.notifyListener(ListenerList.java:148)
              at org.apache.cassandra.utils.concurrent.ListenerList$GenericFutureListenerList.notifySelf(ListenerList.java:190)
              at org.apache.cassandra.utils.concurrent.ListenerList.lambda$notifyExclusive$0(ListenerList.java:124)
              at org.apache.cassandra.utils.concurrent.ListenerList$$Lambda$950/0x0000000800666040.accept(Unknown Source)
              at org.apache.cassandra.utils.concurrent.IntrusiveStack.forEach(IntrusiveStack.java:195)
              at org.apache.cassandra.utils.concurrent.ListenerList.notifyExclusive(ListenerList.java:124)
              at org.apache.cassandra.utils.concurrent.ListenerList.notify(ListenerList.java:96)
              at org.apache.cassandra.utils.concurrent.AsyncFuture.trySet(AsyncFuture.java:104)
              at org.apache.cassandra.utils.concurrent.AbstractFuture.tryFailure(AbstractFuture.java:148)
              at org.apache.cassandra.utils.concurrent.AsyncPromise.tryFailure(AsyncPromise.java:139)
              at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1009)
              at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:870)
              at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
              at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
              at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
              at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
              at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
              at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
              at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
              at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
              at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
              at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
              at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)
      

      It seems that while receive()} is holding the monitor lock on {{StreamSession, the callback that executes on a different thread for the control message it sends carries an error. This error, when handled in onError(), then calls closeSession(), which tries to acquire the monitor lock already held in receive().

      "Stream-Deserializer-/100.70.229.6:7000-de724029" #1919000 daemon prio=5 os_prio=0 cpu=224.66ms elapsed=1604976.92s tid=0x00007f0561c66500 nid=0xe2a2 waiting on condition  [0x00007f0830947000]
         java.lang.Thread.State: WAITING (parking)
              at jdk.internal.misc.Unsafe.park(java.base@11.0.16/Native Method)
              at java.util.concurrent.locks.LockSupport.park(java.base@11.0.16/LockSupport.java:323)
              at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:289)
              at org.apache.cassandra.utils.concurrent.WaitQueue$Standard$AbstractSignal.await(WaitQueue.java:282)
              at org.apache.cassandra.utils.concurrent.Awaitable$AsyncAwaitable.await(Awaitable.java:306)
              at org.apache.cassandra.utils.concurrent.AsyncFuture.await(AsyncFuture.java:154)
              at org.apache.cassandra.utils.concurrent.AsyncPromise.await(AsyncPromise.java:244)
              at org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:127)
              at org.apache.cassandra.net.AsyncChannelPromise.await(AsyncChannelPromise.java:34)
              at org.apache.cassandra.utils.concurrent.Awaitable$Defaults.awaitUninterruptibly(Awaitable.java:186)
              at org.apache.cassandra.utils.concurrent.AbstractFuture.awaitUninterruptibly(AbstractFuture.java:482)
              at org.apache.cassandra.utils.concurrent.AsyncPromise.awaitUninterruptibly(AsyncPromise.java:254)
              at org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:133)
              at org.apache.cassandra.net.AsyncChannelPromise.awaitUninterruptibly(AsyncChannelPromise.java:34)
              at org.apache.cassandra.utils.concurrent.Future.syncUninterruptibly(Future.java:94)
              at org.apache.cassandra.utils.concurrent.AsyncPromise.syncUninterruptibly(AsyncPromise.java:186)
              at org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:121)
              at org.apache.cassandra.net.AsyncChannelPromise.syncUninterruptibly(AsyncChannelPromise.java:34)
              at org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:1054)
              at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:628)
              - locked <0x00000006816fae70> (a org.apache.cassandra.streaming.StreamSession)
              at org.apache.cassandra.streaming.StreamDeserializingTask.run(StreamDeserializingTask.java:76)
              at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
              at java.lang.Thread.run(java.base@11.0.16/Thread.java:829)
      

      The most straightforward way to fix this might be to put an upper bound on the wait in receive(), along w/ some logging at WARN if it times out. This would at least allow us to make progress eventually and close the session properly. The syncUninterruptibly() was intended to avoid a race during shutdown, IIRC, and unless the timeout is comically low, it shouldn't compromise that.

      The problem can usually be fixed by bouncing affected nodes, who will usually present w/ an increasing backlog of unrepaired data and a log message that looks something like this, in addition to a number of streaming errors:

      INFO  2023-08-04T10:45:54,845 [NettyStreaming-Outbound-/<ip:port>] org.apache.cassandra.streaming.async.StreamingMultiplexedChannel:359 - [Stream #10465c10-3108-11ee-af8a-3f74fd21ad9d] waiting to acquire a permit to begin streaming <data directory>-nb-101920-big-Data.db. This message logs every 3 minutes
      

      Attachments

        Issue Links

          Activity

            People

              jmeredithco Jon Meredith
              maedhroz Caleb Rackliffe
              Jon Meredith
              Caleb Rackliffe
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: