Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
None
-
None
Description
If a RAFT snapshot installation takes more than the corresponding timeout (10 seconds in this case), a retry is attempted. The retry, if it finds an ongoing snapshot copier, tries to cancel it, so that on next retry the installation will start over.
In one run of a test, the initial attempt to install a snapshot failed, but then all subsequent attempts were trying to cancel the installation and none of them was actually starting another copier, so an infinite loop was created.
Normally, onSnapshotLoadDone() is invoked even if snapshot load has failed to clean everything up and make next install attempt possible. This clean up includes nullufiying the contents of downloadingSnapshot in SnapshotExecutorImpl. But this time, according to the log, onSnapshotLoadDone() was never invoked, so the old snapshot was remaining as 'downloading' forever.
This could something to do with the fact that the IncomingSnapshotCopier does not set its status as error (with setError()) on cancellation as LocalSnapshotCopier does.
Also, there could be some race.
Attachments
Attachments
Issue Links
- duplicates
-
IGNITE-18495 Fix RAFT snapshot installation hang due to response swap on retry
- Resolved
- relates to
-
IGNITE-18079 Integrate RAFT streaming snapshots
- Resolved