Uploaded image for project: 'Ratis'
  1. Ratis
  2. RATIS-1305

Leader stuck in infinite install snapshot cycle when logs have been purged

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 2.0.0
    • server
    • None

    Description

      After logs have been purged from the leader and followers, the leader repeatedly attempts to send snapshots to the followers, who reject them because there have not been any new transactions to apply. The leader continues to send snapshots infinitely.

      Here is an example of the log messages. om1 is the leader, om2 and om3 are followers.

      On the leader om1:

      om1_1 | 2021-02-02 17:17:23,261 [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: followerNextIndex = 337 but logStartIndex = -1, notify follower to install snapshot-(t:1, i:337)
      om1_1 | 2021-02-02 17:17:23,272 [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: followerNextIndex = 337 but logStartIndex = -1, notify follower to install snapshot-(t:1, i:337)
      om1_1 | 2021-02-02 17:17:23,286 [om1@group-D66704EFC61C->om3-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-GrpcLogAppender: send om1->om3#0-t1,notify:(t:1, i:337)
      om1_1 | 2021-02-02 17:17:23,286 [om1@group-D66704EFC61C->om2-GrpcLogAppender-LogAppenderDaemon] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-GrpcLogAppender: send om1->om2#0-t1,notify:(t:1, i:337)
      om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: received a reply om1<-om3#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
      om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om3-InstallSnapshotResponseHandler: Already Installed Snapshot Index 336.
      om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: snapshotIndex: setUnconditionally 0 -> 336
      om1_1 | 2021-02-02 17:17:23,522 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: matchIndex: setUnconditionally 336 -> 336
      om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: setUnconditionally 337 -> 337
      om1_1 | 2021-02-02 17:17:23,523 [grpc-default-executor-1] INFO leader.FollowerInfo: om1@group-D66704EFC61C->om3: nextIndex: updateToMax old=337, new=337, updated? false
      om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: received a reply om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
      om1_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-1] INFO server.GrpcLogAppender: om1@group-D66704EFC61C->om2-InstallSnapshotResponseHandler: Already Installed Snapshot Index 336.
      
      

       

      On follower om2:

      om2_1 | 2021-02-02 17:17:23,306 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: receive installSnapshot: om1->om2#0-t1,notify:(t:1, i:337)
      om2_1 | 2021-02-02 17:17:23,312 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: StateMachine snapshotIndex is 336
      om2_1 | 2021-02-02 17:17:23,560 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: set new configuration configurationEntry {
      om2_1 | peers {
      om2_1 | id: "om1"
      om2_1 | address: "om1:9872"
      om2_1 | }
      om2_1 | peers {
      om2_1 | id: "om3"
      om2_1 | address: "om3:9872"
      om2_1 | }
      om2_1 | peers {
      om2_1 | id: "om2"
      om2_1 | address: "om2:9872"
      om2_1 | }
      om2_1 | }
      om2_1 | from snapshot
      om2_1 | 2021-02-02 17:17:23,561 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: set configuration 0: [om1|rpc:om1:9872|dataStream:|priority:0, om3|rpc:om3:9872|dataStream:|priority:0, om2|rpc:om2:9872|dataStream:|priority:0], old=null
      om2_1 | 2021-02-02 17:17:23,567 [grpc-default-executor-0] INFO server.RaftServer$Division: om2@group-D66704EFC61C: reply installSnapshot: om1<-om2#0:FAIL-t1,ALREADY_INSTALLED,snapshotIndex=336
      om2_1 | 2021-02-02 17:17:23,570 [grpc-default-executor-0] INFO server.GrpcServerProtocolService: om2: Completed INSTALL_SNAPSHOT, lastRequest: om1->om2#0-t1,notify:(t:1, i:337)
      
      

       

      These log messages are repeated forever until the cluster is terminated. The term and index numbers do not change.

       

      EDIT: I think the cluster not responding was caused by something different. This behavior is still a bug, however.

       

      Attachments

        Issue Links

          Activity

            People

              erose Ethan Rose
              erose Ethan Rose
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 50m
                  2h 50m