Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-18428

After a RAFT snapshot install timed out, subsequent installs consistently failed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • None
    • 3.0.0-beta2
    • None

    Description

      If a RAFT snapshot installation takes more than the corresponding timeout (10 seconds in this case), a retry is attempted. The retry, if it finds an ongoing snapshot copier, tries to cancel it, so that on next retry the installation will start over.

      In one run of a test, the initial attempt to install a snapshot failed, but then all subsequent attempts were trying to cancel the installation and none of them was actually starting another copier, so an infinite loop was created.

      Normally, onSnapshotLoadDone() is invoked even if snapshot load has failed to clean everything up and make next install attempt possible. This clean up includes nullufiying the contents of downloadingSnapshot in SnapshotExecutorImpl. But this time, according to the log, onSnapshotLoadDone() was never invoked, so the old snapshot was remaining as 'downloading' forever.

      This could something to do with the fact that the IncomingSnapshotCopier does not set its status as error (with setError())  on cancellation as LocalSnapshotCopier does.

      Also, there could be some race.

      Attachments

        1. test.log.txt
          2.39 MB
          Roman Puchkovskiy

        Issue Links

          Activity

            People

              rpuch Roman Puchkovskiy
              rpuch Roman Puchkovskiy
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: