Description
After logs have been purged from the leader and followers, the leader repeatedly attempts to send snapshots to the followers, who reject them because there have not been any new transactions to apply. The leader continues to send snapshots infinitely.
Here is an example of the log messages. om1 is the leader, om2 and om3 are followers.
On the leader om1:
om1_1 | 2021-02-02 17:17:23,261 [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: followerNextIndex = 337 but logStartIndex = -1, notify follower to install snapshot-(t:1, i:337) om1_1 | 2021-02-02 17:17:23,272 [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: followerNextIndex = 337 but logStartIndex = -1, notify follower to install snapshot-(t:1, i:337) om1_1 | 2021-02-02 17:17:23,286 [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send om1->om3#0-t1,notify:(t:1, i:337) om1_1 | 2021-02-02 17:17:23,286 [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send om1->om2#0-t1,notify:(t:1, i:337) om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336 om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed Snapshot Index 336. om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex: setUnconditionally 0 -> 336 om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex: setUnconditionally 336 -> 336 om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: setUnconditionally 337 -> 337 om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax old=337, new=337, updated? false om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336 om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed Snapshot Index 336.
On follower om2:
om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot: om1->om2#0-t1,notify:(t:1, i:337) om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine snapshotIndex is 336 om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration configurationEntry { om2_1 | peers { om2_1 | id: "om1" om2_1 | address: "om1:9872" om2_1 | } om2_1 | peers { om2_1 | id: "om3" om2_1 | address: "om3:9872" om2_1 | } om2_1 | peers { om2_1 | id: "om2" om2_1 | address: "om2:9872" om2_1 | } om2_1 | } om2_1 | from snapshot om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0: [om1|rpc:om1:9872|dataStream:|priority:0, om3|rpc:om3:9872|dataStream:|priority:0, om2|rpc:om2:9872|dataStream:|priority:0], old=null om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot: om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336 om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: om1->om2#0-t1,notify:(t:1, i:337)
These log messages are repeated forever until the cluster is terminated. The term and index numbers do not change.
EDIT: I think the cluster not responding was caused by something different. This behavior is still a bug, however.
Attachments
Issue Links
- is required by
-
HDDS-3698 Ozone Non-Rolling upgrades
- Resolved
- links to