Description
I found that after the following log, the follower became unavailable. The follower received incorrect entries repeatedly for about 10min, then got installSnapshot failed and started to election. After two hours, it succeed to install snapshot, but failed to updateLastAppliedTermIndex. After that, it repeated 'receive installSnapshot and installSnapshot failed' for several hours until I restarted the server.
24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: Failed appendEntries as the first entry (index 34795876) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
24/08/11 09:03:13,714 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: inconsistency entries. Reply:3<-1#2559343:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: Failed appendEntries as the first entry (index 34795875) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
24/08/11 09:03:13,715 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: inconsistency entries. Reply:3<-1#2559406:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
....(repeat 'Failed appendEntries')
24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: Failed appendEntries as the first entry (index 34465382) already exists (snapshotIndex: 34670809, commitIndex: 34795893)
24/08/11 09:15:41,827 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: inconsistency entries. Reply:3<-1#2892557:FAIL-t59,INCONSISTENCY,nextIndex=34795894,followerCommit=34795893,matchIndex=-1
24/08/11 09:15:42,230 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:bbe49073-5dad-4499-9051-58a0e53b0658,0
24/08/11 09:15:42,231 ERROR [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 34795893, last included index in snapshot is 34670057
24/08/11 09:15:42,233 WARN [nioEventLoopGroup-3-3] DefaultChannelPipeline: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 34795893, last included index in snapshot is 34670057
24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] FollowerState: 1@group-47BEDE733167-FollowerState: change to CANDIDATE, lastRpcElapsedTime:7874610911ns, electionTimeout:3353ms
24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: shutdown 1@group-47BEDE733167-FollowerState
24/08/11 09:15:50,105 INFO [1@group-47BEDE733167-FollowerState] RaftServer$Division: 1@group-47BEDE733167: changes role from FOLLOWER to CANDIDATE at term 59 for changeToCandidate
24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RaftServerConfigKeys: raft.server.leaderelection.pre-vote = true (default)
24/08/11 09:15:50,106 INFO [1@group-47BEDE733167-FollowerState] RoleInfo: 1: start 1@group-47BEDE733167-LeaderElection5
24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] RaftServer$Division: 1@group-47BEDE733167: change Leader from 3 to null at term 59 for PRE_VOTE
24/08/11 09:15:50,107 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: submit vote requests at term 59 for 34233595: peers:[1|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, 2|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER, 3|rpc:xxx:9862|admin:|client:xxx:9087|dataStream:|priority:0|startupRole:FOLLOWER]|listeners:[], old=null
24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: 1@group-47BEDE733167-LeaderElection5: PRE_VOTE REJECTED received 2 response(s) and 0 exception(s):
24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: Response 0: 1<-2#0:FAIL-t59
24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: Response 1: 1<-3#0:FAIL-t59
24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] LeaderElection: 1@group-47BEDE733167-LeaderElection5 PRE_VOTE round 0: result REJECTED
24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] RaftServer$Division: 1@group-47BEDE733167: changes role from CANDIDATE to FOLLOWER at term 59 for REJECTED
24/08/11 09:15:50,110 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: 1: shutdown 1@group-47BEDE733167-LeaderElection5
24/08/11 09:15:50,111 INFO [1@group-47BEDE733167-LeaderElection5] RoleInfo: 1: start 1@group-47BEDE733167-FollowerState
24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:802b8106-076b-4ffc-8cf1-1ace04952c7a,0
24/08/11 09:15:52,075 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: change Leader from null to 3 at term 59 for installSnapshot, leader elected after 1967ms
24/08/11 09:15:52,075 ERROR [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
java.lang.IllegalStateException: 1@group-47BEDE733167 log's commit index is 34795893, last included index in snapshot is 34670057
(repeat 'election, fail, install snapshot fail' for more than 2 hours)
(still election fail, but this time install snapshot successfully, but failed to updateLastAppliedTermIndex)
24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0
24/08/11 11:52:38,168 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 1@group-47BEDE733167: change Leader from null to 3 at term 59 for installSnapshot, leader elected after 3610ms
24/08/11 11:52:39,434 INFO [nioEventLoopGroup-3-3] SnapshotManager: Installing snapshot:3->1#0-t59,chunk:dcc73f7c-c17d-44e5-96b6-f6f1570e68a7,0, to tmp dir:/data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7
24/08/11 11:52:39,455 INFO [nioEventLoopGroup-3-3] SnapshotManager: Installed snapshot, renaming temporary dir /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/snapshot-dcc73f7c-c17d-44e5-96b6-f6f1570e68a7 to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm
24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: Reinitializing state machine.
24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] RaftLog: 1@group-47BEDE733167-SegmentedRaftLog: snapshotIndex: updateIncreasingly 33870802 -> 34870063
24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachine: obsolete snapshot provided: (t:59, i:34670809)
24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: flushIndex: setUnconditionally 34795893 -> 34870063
24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: safeCacheEvictIndex: setUnconditionally 34783743 -> 34870063
24/08/11 11:52:40,829 INFO [1@group-47BEDE733167-StateMachineUpdater] StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater: appliedIndex: setUnconditionally 34795893 -> 34670809
24/08/11 11:52:40,829 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: Closing segment log_inprogress_34783744 to index: 34795893
24/08/11 11:52:40,837 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: successfully install the entire snapshot-34870063
24/08/11 11:52:40,877 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: reply installSnapshot: 3<-1#0:OK-t59,SUCCESS,requestIndex=0
24/08/11 11:52:40,902 INFO [1@group-47BEDE733167-SegmentedRaftLogWorker] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker: Rolled log segment from /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_inprogress_34783744 to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/current/log_34783744-34795893
24/08/11 11:52:41,160 ERROR [1@group-47BEDE733167-StateMachineUpdater] StateMachineUpdater: 1@group-47BEDE733167-StateMachineUpdater caught a Throwable.
java.lang.IllegalStateException: 1: Failed updateLastAppliedTermIndex: newTI = (t:59, i:34670810) < oldTI = (t:59, i:34795893)
24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] RaftServer$Division: 1@group-47BEDE733167: shutdown
24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] JmxRegister: Successfully un-registered JMX Bean with object name Ratis:service=RaftServer,group=group-47BEDE733167,id=1
24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] RoleInfo: 1: shutdown 1@group-47BEDE733167-FollowerState
24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-FollowerState] FollowerState: 1@group-47BEDE733167-FollowerState was interrupted
24/08/11 11:52:41,161 INFO [1@group-47BEDE733167-StateMachineUpdater] RaftServer$Division: 1@group-47BEDE733167: closes. applyIndex: 34670810
24/08/11 11:52:42,340 INFO [1@group-47BEDE733167-StateMachineUpdater] SegmentedRaftLogWorker: 1@group-47BEDE733167-SegmentedRaftLogWorker close()
24/08/11 11:52:48,170 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:7179265b-d007-4835-9b5f-fa9e4d853b72,0
24/08/11 11:52:48,170 ERROR [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
org.apache.ratis.protocol.exceptions.ServerNotReadyException: 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
24/08/11 11:52:58,171 INFO [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: receive installSnapshot: 3->1#0-t59,chunk:7f56f35f-20f3-4b42-bdb6-c2d52db646fc,0
24/08/11 11:52:58,171 ERROR [nioEventLoopGroup-3-3] SnapshotInstallationHandler: 1@group-47BEDE733167: installSnapshot failed
org.apache.ratis.protocol.exceptions.ServerNotReadyException: 1@group-47BEDE733167 is not in [STARTING, RUNNING]: current state is CLOSED
(repeat 'receive installSnapshot and installSnapshot failed' for serval hours until I restarted the server)
Here is what I found in the leader's log
24/08/11 09:03:10,130 WARN [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] LogAppender: 3@group-47BEDE733167->1-LogAppenderDefault: Failed to appendEntries (retry=1): org.apache.ratis.protocol.exceptions.TimeoutIOException
24/08/11 09:03:13,714 INFO [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally 34795876 -> 34795875
24/08/11 09:03:13,715 INFO [3@group-47BEDE733167->1-LogAppenderDefault-LogAppenderDaemon] FollowerInfo: 3@group-47BEDE733167->1: decreaseNextIndex nextIndex: updateUnconditionally 34795875 -> 34795874
I guess when the leader called appendEntries for the first time, the follower had been executed successfully, but the leader did not receive the follower's response and times out. When it resent, it found that it could not match the change and started to decreaseNextIndex.
Sometimes it can be fixed automatically by rolling segment log, but not always.
24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 2@group-47BEDE733167: Failed appendEntries as the first entry (index 35049128) already exists (snapshotIndex: 35070063, commitIndex: 35259248)
24/08/12 11:47:53,351 INFO [nioEventLoopGroup-3-3] RaftServer$Division: 2@group-47BEDE733167: inconsistency entries. Reply:3<-2#3583083:FAIL-t59,INCONSISTENCY,nextIndex=35259249,followerCommit=35259248,matchIndex=-1
24/08/12 11:47:55,132 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment log-35250411_35261308 to index:35261308
24/08/12 11:54:46,402 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Taking a snapshot to file /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1723434886330_4407392949522239377.dat.
24/08/12 11:54:46,403 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Saving digest 272e2c87a8e62785588ab383c0b8c290 for snapshot file /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_35270066.
24/08/12 11:54:46,458 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachine: Renaming a snapshot file /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/tmp/raft_snapshot_1723434886330_4407392949522239377.dat to /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_35270066.
24/08/12 11:54:46,487 INFO [2@group-47BEDE733167-StateMachineUpdater] SimpleStateMachineStorage: Deleting old snapshot at /data0/celeborn/ha_ratis/c5196f6d-2c34-3ed3-8b8a-47bede733167/sm/snapshot.59_34670063
24/08/12 11:54:46,530 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachineUpdater: 2@group-47BEDE733167-StateMachineUpdater: Took a snapshot at index 35270066
24/08/12 11:54:46,530 INFO [2@group-47BEDE733167-StateMachineUpdater] StateMachineUpdater: 2@group-47BEDE733167-StateMachineUpdater: snapshotIndex: updateIncreasingly 35070063 -> 35270066
24/08/12 11:57:16,592 INFO [nioEventLoopGroup-3-3] SegmentedRaftLogWorker: 2@group-47BEDE733167-SegmentedRaftLogWorker: Rolling segment log-35261309_35271592 to index:35271592
Attachments
Attachments
Issue Links
- links to