Description
The phenomenon shows that in the ozone cluster, OM fails to install the snapshot. From the OM log, OM state machine has done its part(eg. download Checkpoint, install, load).
First,
stateMachine.followerEvent().notifyInstallSnapshotFromLeader(roleInfoProto, firstAvailableLogTermIndex).whenComplete(...)
it is an async action of CompletableFuture. Normally, the follower should be able to receive the future Installsnapshot request and tell back once it has already installed snapshot. But I found that the leader will not send Installsnapshot requests anymore.
During whenComplete stage, these followings will be executed, which would update the snapshot index and commit index.
stateMachine.pause(); state.updateInstalledSnapshotIndex(reply); state.reloadStateMachine(reply.getIndex()); installedSnapshotIndex.set(reply.getIndex());
In the process of appendEntriesAsync, checkInconsistentAppendEntries will return inconsistency as the snapshot is still in progress. Once the actual upgrade of snapshot index and commit index takes place, the leader receives the inconsistency with the new index and then won't send installsnapshot requests anymore as the check of shouldNotifyToInstallSnapshot() will be null.
Meanwhile, due to the async action of CompletableFuture, the follower raft server has not yet sent the SNAPSHOT_INSTALLED to leader according to the previous installsnapshot request and cannot receive future requests. This lead to an infinite loop of failed appendEntries and disappeared installshot progress
Attachments
Issue Links
- relates to
-
RATIS-1577 Install snapshot failure
- Resolved
- links to