Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-1241

Leader unable to append logs to a recovering follower when its logs have been purged.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • None
    • 1.1.0
    • server
    • None

    Description

      Steps to reproduce

      • Setup a 3 node ratis group.
      • Write some transactions into the quorum.
      • Bring 1 peer down.
      • Write more transactions into the quorum (other 2 nodes), take a snapshot at the last txn and purge logs from the remaining 2 nodes.
      • Start the node that was brought down.

      Leader falls into a possible irrecoverable state with respect to appending log entries to the follower.

      First time the follower comes back

      2020-12-15 15:42:01,069 [grpc-default-executor-5] WARN  server.GrpcLogAppender (GrpcLogAppender.java:resetClient(110)) - omNode-3@group-523986131536->omNode-2-GrpcLogAppender: Leader has not got in touch with Follower omNode-3@group-523986131536->omNode-2(c-1,m0,n409, attendVote=true, lastRpcSendTime=0, lastRpcResponseTime=8398) yet, just keep nextIndex unchanged and retry.
      2020-12-15 15:42:01,570 [grpc-default-executor-5] INFO  server.RaftServer$Division (ServerState.java:setLeader(260)) - omNode-2@group-523986131536: change Leader from null to omNode-3 at term 3 for appendEntries, leader elected after 954ms
      2020-12-15 15:42:01,570 [grpc-default-executor-5] INFO  server.RaftServer$Division (RaftServerImpl.java:checkInconsistentAppendEntries(1249)) - omNode-2@group-523986131536: Failed appendEntries as previous log entry ((t:3, i:408)) is not found
      2020-12-15 15:42:01,571 [grpc-default-executor-5] INFO  server.RaftServer$Division (RaftServerImpl.java:checkInconsistentAppendEntries(1215)) - omNode-2@group-523986131536: inconsistency entries. Reply:omNode-3<-omNode-2#11187658:FAIL,INCONSISTENCY,nextIndex:205,term:2,followerCommit:203
      2020-12-15 15:42:01,572 [grpc-default-executor-5] INFO  leader.FollowerInfo (FollowerInfoImpl.java:lambda$new$0(46)) - omNode-3@group-523986131536->omNode-2: nextIndex: updateUnconditionally 409 -> 205
      2020-12-15 15:42:01,572 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] ERROR leader.LogAppenderDaemon (LogAppenderDaemon.java:run(86)) - omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon failed
      org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: index = 205
              at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:283)
              at org.apache.ratis.server.leader.LogAppenderBase.newAppendEntriesRequest(LogAppenderBase.java:143)
              at org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:210)
              at org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:144)
              at org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:77)
              at java.lang.Thread.run(Thread.java:748)
      2020-12-15 15:42:01,572 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  server.RaftServer$Division (LeaderStateImpl.java:restart(497)) - omNode-3@group-523986131536-LeaderStateImpl: Restarting GrpcLogAppender for omNode-3@group-523986131536->omNode-2
      

      After the above, the leader falls into this error loop.

      2020-12-15 15:42:01,574 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] ERROR leader.LogAppenderDaemon (LogAppenderDaemon.java:run(86)) - omNode-3@group-523986131536->omNode-2-GrpcLogAp
      pender-LogAppenderDaemon failed
      org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: index = 0
              at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:283)
              at org.apache.ratis.server.leader.LogAppenderBase.newAppendEntriesRequest(LogAppenderBase.java:143)
              at org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:210)
              at org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:144)
              at org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:77)
              at java.lang.Thread.run(Thread.java:748)
      2020-12-15 15:42:01,575 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  server.RaftServer$Division (LeaderStateImpl.java:restart(497)) - omNode-3@group-523986131536-LeaderStateImp
      l: Restarting GrpcLogAppender for omNode-3@group-523986131536->omNode-2
      2020-12-15 15:42:01,575 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  metrics.RatisMetrics (RatisMetrics.java:unregister(43)) - Unregistering Metrics Registry : ratis_grpc.log_a
      ppender.omNode-3@group-523986131536
      2020-12-15 15:42:01,575 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.log.appender.snapshot.chunk.size.max
      = 16MB (=16777216) (default)
      2020-12-15 15:42:01,575 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.log.appender.buffer.byte-limit = 3355
      4432 (custom)
      2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.log.appender.buffer.element-limit = 1
      024 (custom)
      2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  grpc.GrpcConfigKeys (ConfUtils.java:logGet(44)) - raft.grpc.server.leader.outstanding.appends.max = 128 (de
      fault)
      2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.rpc.request.timeout = 3000ms (default
      )
      2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  server.RaftServerConfigKeys (ConfUtils.java:logGet(44)) - raft.server.log.appender.install.snapshot.enabled
       = false (custom)
      2020-12-15 15:42:01,576 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  metrics.RatisMetrics (RatisMetrics.java:lambda$create$0(36)) - Creating Metrics Registry : ratis_grpc.log_a
      ppender.omNode-3@group-523986131536
      2020-12-15 15:42:01,577 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] ERROR leader.LogAppenderDaemon (LogAppenderDaemon.java:run(86)) - omNode-3@group-523986131536->omNode-2-GrpcLogAp
      pender-LogAppenderDaemon failed
      org.apache.ratis.server.raftlog.RaftLogIOException: Log entry not found: index = 0
              at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.getEntryWithData(SegmentedRaftLog.java:283)
              at org.apache.ratis.server.leader.LogAppenderBase.newAppendEntriesRequest(LogAppenderBase.java:143)
              at org.apache.ratis.grpc.server.GrpcLogAppender.appendLog(GrpcLogAppender.java:210)
              at org.apache.ratis.grpc.server.GrpcLogAppender.run(GrpcLogAppender.java:144)
              at org.apache.ratis.server.leader.LogAppenderDaemon.run(LogAppenderDaemon.java:77)
              at java.lang.Thread.run(Thread.java:748)
      2020-12-15 15:42:01,578 [omNode-3@group-523986131536->omNode-2-GrpcLogAppender-LogAppenderDaemon] INFO  server.RaftServer$Division (LeaderStateImpl.java:restart(497)) - omNode-3@group-523986131536-LeaderStateImpl: Restarting GrpcLogAppender for omNode-3@group-523986131536->omNode-2
      

      Attachments

        Issue Links

          Activity

            People

              avijayan Aravindan Vijayan
              avijayan Aravindan Vijayan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h