Description
Steps to reproduce
- Setup a 3 node ratis group.
- Write some transactions into the quorum.
- Bring 1 peer down.
- Write more transactions into the quorum (other 2 nodes), take a snapshot at the last txn and purge logs from the remaining 2 nodes.
- Start the node that was brought down.
Leader falls into a possible irrecoverable state with respect to appending log entries to the follower.
First time the follower comes back
2020-12-15 15:42:01,069 [grpc-default-executor-5] WARN server.GrpcLogAppender (GrpcLogAppender.java:resetClient(110)) - omNode-3@group-523986131536->omNode-2-GrpcLogAppender: Leader has not got in touch with Follower omNode-3@group-523986131536->omNode-2(c-1,m0,n409, attendVote=true, lastRpcSendTime=0, lastRpcResponseTime=8398) yet, just keep nextIndex unchanged and retry. 2020-12-15 15:42:01,570 [grpc-default-executor-5] INFO server.RaftServer$Division (ServerState.java:setLeader(260)) - omNode-2@group-523986131536: change Leader from null to omNode-3 at term 3 for appendEntries, leader elected after 954ms 2020-12-15 15:42:01,570 [grpc-default-executor-5] INFO server.RaftServer$Division (RaftServerImpl.java:checkInconsistentAppendEntries(1249)) - omNode-2@group-523986131536: Failed appendEntries as previous log entry ((t:3, i:408)) is not found 2020-12-15 15:42:01,571 [grpc-default-executor-5] INFO server.RaftServer$Division (RaftServerImpl.java:checkInconsistentAppendEntries(1215)) - omNode-2@group-523986131536: inconsistency entries. Reply:omNode-3<-omNode-2#11187658:FAIL,INCONSISTENCY,nextIndex:205,term:2,followerCommit:203 2020-12-15 15:42:01,572 [grpc-default-executor-5] INFO leader.FollowerInfo (FollowerInfoImpl.java:lambda$new$0(46)) - omNode-3@group-523986131536->omNode-2: nextIndex: updateUnconditionally 409 -> 205 2020-12-15 15:42:01,572 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] ERROR leader.LogAppenderDaemon (LogAppenderDaemon.java:run(86)) - omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon failed org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: index = 205 at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:283) at org.apache.ratis.server.leader.LogAppenderBase.newAppendEntriesRequest(LogAppenderBase.java:143) at org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:210) at org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:144) at org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:77) at java.lang.Thread.run(Thread.java:748) 2020-12-15 15:42:01,572 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO server.RaftServer$Division (LeaderStateImpl.java:restart(497)) - omNode-3@group-523986131536-LeaderStateImpl: Restarting GrpcLogAppender for omNode-3@group-523986131536->omNode-2
After the above, the leader falls into this error loop.
2020-12-15 15:42:01,574 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] ERROR leader.LogAppenderDaemon (LogAppenderDaemon.java:run(86)) - omNode-3@group-523986131536->omNode-2-GrpcLogAp pender-LogAppenderDaemon failed org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: index = 0 at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:283) at org.apache.ratis.server.leader.LogAppenderBase.newAppendEntriesRequest(LogAppenderBase.java:143) at org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:210) at org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:144) at org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:77) at java.lang.Thread.run(Thread.java:748) 2020-12-15 15:42:01,575 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO server.RaftServer$Division (LeaderStateImpl.java:restart(497)) - omNode-3@group-523986131536-LeaderStateImp l: Restarting GrpcLogAppender for omNode-3@group-523986131536->omNode-2 2020-12-15 15:42:01,575 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO metrics.RatisMetrics (RatisMetrics.java:unregister(43)) - Unregistering Metrics Registry : ratis_grpc.log_a ppender.omNode-3@group-523986131536 2020-12-15 15:42:01,575 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.log.appender.snapshot.chunk.size.max = 16MB (=16777216) (default) 2020-12-15 15:42:01,575 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.log.appender.buffer.byte-limit = 3355 4432 (custom) 2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.log.appender.buffer.element-limit = 1 024 (custom) 2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO grpc.GrpcConfigKeys (ConfUtils.java:logGet(44)) - raft.grpc.server.leader.outstanding.appends.max = 128 (de fault) 2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.rpc.request.timeout = 3000ms (default ) 2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.log.appender.install.snapshot.enabled = false (custom) 2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO metrics.RatisMetrics (RatisMetrics.java:lambda$create$0(36)) - Creating Metrics Registry : ratis_grpc.log_a ppender.omNode-3@group-523986131536 2020-12-15 15:42:01,577 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] ERROR leader.LogAppenderDaemon (LogAppenderDaemon.java:run(86)) - omNode-3@group-523986131536->omNode-2-GrpcLogAp pender-LogAppenderDaemon failed org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: index = 0 at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:283) at org.apache.ratis.server.leader.LogAppenderBase.newAppendEntriesRequest(LogAppenderBase.java:143) at org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:210) at org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:144) at org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:77) at java.lang.Thread.run(Thread.java:748) 2020-12-15 15:42:01,578 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO server.RaftServer$Division (LeaderStateImpl.java:restart(497)) - omNode-3@group-523986131536-LeaderStateImpl: Restarting GrpcLogAppender for omNode-3@group-523986131536->omNode-2