Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-2116

Follower state synchronization is blocked

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0, 2.5.1, 3.0.1
    • 3.1.1, 3.2.0
    • server
    • None

    Description

      Using version 2.5.1, we have discovered that in some cases, the state synchronization of the follower will be permanently blocked.

      Scenario: When the task queue of the SegmentedRaftLogWorker is the pattern (WriteLog, WriteLog, ..., PurgeLog), the last WriteLog of RaftServerImpl.appendEntries does not immediately flush data and complete the result future, because there is a pending PurgeLog task in the queue. It enqueues the result future to be completed after the latter WriteLog flushes data. However, the "nioEventLoopGroup-3-1" thread is already blocked, and will not add new WriteLog to the task queue of SegmentedRaftLogWorker. This leads to a deadlock and causes the state synchronization to stop.

      I confirmed this by adding debug logs, detailed information is attached below. This issue can be easily reproduced by increasing the frequency of TakeSnapshot and PurgeLog operations. In addition, after checking the code in the master branch, this issue still exists.
       
      jstack:

      "nioEventLoopGroup-3-1" #58 prio=10 os_prio=0 tid=0x00007fc58400b800 nid=0x5493a waiting on condition [0x00007fc5b4f28000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park0(Native Method) parking to wait for <0x00007fd86a4685e8> (a java.util.concurrent.CompletableFuture$Signaller) at sun.misc.Unsafe.park(Unsafe.java:1025) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:176) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729) at java.util.concurrent.CompletableFuture.join(CompletableFuture.java:1934) at org.apache.ratis.server.impl.RaftServerImpl.appendEntries(RaftServerImpl.java:1379) at org.apache.ratis.server.impl.RaftServerProxy.appendEntries(RaftServerProxy.java:649) at org.apache.ratis.netty.server.NettyRpcService.handle(NettyRpcService.java:231) at org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:95) at org.apache.ratis.netty.server.NettyRpcService$InboundHandler.channelRead0(NettyRpcService.java:91) at org.apache.ratis.thirdparty.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at org.apache.ratis.thirdparty.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) at org.apache.ratis.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) at org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) at org.apache.ratis.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) at org.apache.ratis.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at org.apache.ratis.thirdparty.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) at org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) at org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) at org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) at org.apache.ratis.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) at org.apache.ratis.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) at org.apache.ratis.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.apache.ratis.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:882)

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            sunhaibotb Haibo Sun
            sunhaibotb Haibo Sun
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment