Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-2137

Leader fails to send correct index to follower after timeout exception

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.5.1
    • 3.1.1, 3.2.0
    • server
    • None

    Description

      I found that after the following log, the follower became unavailable. The follower received incorrect entries repeatedly for about 10min, then got installSnapshot failed and started to election. After two hours, it succeed to install snapshot, but failed to updateLastAppliedTermIndex. After that, it repeated 'receive installSnapshot and installSnapshot failed' for several hours until I restarted the server.

      24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: Failed appendEntries as the first entry (index 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
      24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: inconsistency entries. Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
      24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: Failed appendEntries as the first entry (index 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
      24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: inconsistency entries. Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
      ....(repeat 'Failed appendEntries')
      24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: Failed appendEntries as the first entry (index 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
      24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: inconsistency entries. Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
      24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
      24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
      java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 34795893, last included index in snapshot is 34670057
      24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
      java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 34795893, last included index in snapshot is 34670057
      24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
      24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: shutdown 1@group-47BEDE733167-FollowerState
      24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to CANDIDATE at term 59 for changeToCandidate
      24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
      24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: start 1@group-47BEDE733167-LeaderElection5
      24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at term 59 for PRE_VOTE
      24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit vote requests at term 59 for 34233595: peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[], old=null
      24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: 1@group-47BEDE733167-LeaderElection5: PRE_VOTE REJECTED received 2 response(s) and 0 exception(s):
      24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: Response 0: 1<-2#0:FAIL-t59
      24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: Response 1: 1<-3#0:FAIL-t59
      24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: result REJECTED
      24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] RaftServer$Division: 1@group-47BEDE733167: changes role from CANDIDATE to FOLLOWER at term 59 for REJECTED
      24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: 1: shutdown 1@group-47BEDE733167-LeaderElection5
      24/08/11 09:15:50,111 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: 1: start 1@group-47BEDE733167-FollowerState
      24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:802b8106-076b-4ffc-8cf1-1ace04952c7a,0
      24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: change Leader from null to 3 at term 59 for installSnapshot, leader elected after 1967ms
      24/08/11 09:15:52,075 ERROR [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
      java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 34795893, last included index in snapshot is 34670057
      (repeat 'election, fail, install snapshot fail' for more than 2 hours)
      (still election fail, but this time install snapshot successfully, but failed to updateLastAppliedTermIndex)
      24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0
      24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: change Leader from null to 3 at term 59 for installSnapshot, leader elected after 3610ms
      24/08/11 11:52:39,434 INFO [nioEventLoopGroup-3-3] SnapshotManager: Installing snapshot:3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0, to tmp dir:/data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7
      24/08/11 11:52:39,455 INFO [nioEventLoopGroup-3-3] SnapshotManager: Installed snapshot, renaming temporary dir /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7 to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm
      24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Reinitializing state machine.
      24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] RaftLog: 1@group-47BEDE733167-SegmentedRaftLog: snapshotIndex: updateIncreasingly 33870802 -> 34870063
      24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: obsolete snapshot provided: (t:59, i:34670809)
      24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: flushIndex: setUnconditionally 34795893 -> 34870063
      24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: safeCacheEvictIndex: setUnconditionally 34783743 -> 34870063
      24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater: appliedIndex: setUnconditionally 34795893 -> 34670809
      24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: Closing segment log_inprogress_34783744 to index: 34795893
      24/08/11 11:52:40,837 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: successfully install the entire snapshot-34870063
      24/08/11 11:52:40,877 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: reply installSnapshot: 3<-1#0:OK-t59,SUCCESS,requestIndex=0
      24/08/11 11:52:40,902 INFO [1@group-47BEDE733167-SegmentedRaftLogWorker] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: Rolled log segment from /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_inprogress_34783744 to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_34783744-34795893
      24/08/11 11:52:41,160 ERROR [1@group-47BEDE733167-StateMachineUpdater] StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater caught a Throwable.
      java.lang.IllegalStateException: 1: Failed updateLastAppliedTermIndex: newTI = (t:59, i:34670810) < oldTI = (t:59, i:34795893)
      24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] RaftServer$Division: 1@group-47BEDE733167: shutdown
      24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] JmxRegister: Successfully un-registered JMX Bean with object name Ratis:service=RaftServer,group=group-47BEDE733167,id=1
      24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] RoleInfo: 1: shutdown 1@group-47BEDE733167-FollowerState
      24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-FollowerState] FollowerState: 1@group-47BEDE733167-FollowerState was interrupted
      24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] RaftServer$Division: 1@group-47BEDE733167: closes. applyIndex: 34670810
      24/08/11 11:52:42,340 INFO [1@group-47BEDE733167-StateMachineUpdater] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker close()
      24/08/11 11:52:48,170 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:7179265b-d007-4835-9b5f-fa9e4d853b72,0
      24/08/11 11:52:48,170 ERROR [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
      org.apache.ratis.protocol.exceptions.ServerNotReadyException: 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
      24/08/11 11:52:58,171 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:7f56f35f-20f3-4b42-bdb6-c2d52db646fc,0
      24/08/11 11:52:58,171 ERROR [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
      org.apache.ratis.protocol.exceptions.ServerNotReadyException: 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
      (repeat 'receive installSnapshot and installSnapshot failed' for serval hours until I restarted the server)

      Here is what I found in the leader's log

      24/08/11 09:03:10,130 WARN [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] LogAppender: 3@group-47BEDE733167->1-LogAppenderDefault: Failed to appendEntries (retry=1): org.apache.ratis.protocol.exceptions.TimeoutIOException
      24/08/11 09:03:13,714 INFO [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally 34795876 -> 34795875
      24/08/11 09:03:13,715 INFO [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally 34795875 -> 34795874

      I guess when the leader called appendEntries for the first time, the follower had been executed successfully, but the leader did not receive the follower's response and times out. When it resent, it found that it could not match the change and started to decreaseNextIndex.

      Sometimes it can be fixed automatically by rolling segment log, but not always.
      24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 2@group-47BEDE733167: Failed appendEntries as the first entry (index 35049128) already exists (snapshotIndex: 35070063, commitIndex: 35259248)
      24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 2@group-47BEDE733167: inconsistency entries. Reply:3<-2#3583083:FAIL-t59,INCONSISTENCY,nextIndex=35259249,followerCommit=35259248,matchIndex=-1
      24/08/12 11:47:55,132 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment log-35250411_35261308 to index:35261308
      24/08/12 11:54:46,402 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Taking a snapshot to file /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1723434886330_4407392949522239377.dat.
      24/08/12 11:54:46,403 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Saving digest 272e2c87a8e62785588ab383c0b8c290 for snapshot file /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_35270066.
      24/08/12 11:54:46,458 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Renaming a snapshot file /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1723434886330_4407392949522239377.dat to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_35270066.
      24/08/12 11:54:46,487 INFO [2@group-47BEDE733167-StateMachineUpdater] SimpleStateMachineStorage: Deleting old snapshot at /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_34670063
      24/08/12 11:54:46,530 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachineUpdater: 2@group-47BEDE733167-StateMachineUpdater: Took a snapshot at index 35270066
      24/08/12 11:54:46,530 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachineUpdater: 2@group-47BEDE733167-StateMachineUpdater: snapshotIndex: updateIncreasingly 35070063 -> 35270066
      24/08/12 11:57:16,592 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment log-35261309_35271592 to index:35271592

      Attachments

        Issue Links

          Activity

            People

              lemony Kevin Liu
              lemony Kevin Liu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h