Uploaded image for project: 'Ignite'
  1. Ignite
  2. IGNITE-20425

Corrupted Raft FSM state after restart

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 3.0
    • None

    Description

      According to the protocol, there are several numeric indexes in the Log / FSM:

      • lastLogIndex - index of the last logged log entry.
      • committedIndex - index of last committed log entry. committedIndex <= lastLogIndex.
      • appliedIndex - index of last log entry, processed by the state machine. appliedIndex <= committedIndex.

      If committed index is less then last index, RAFT can invoke the "truncate suffix" procedure and delete uncommitted log's tail. This is a valid thing to do.

      Now, imagine the following scenario:

      • lastIndex == 12, committedIndex == 11
      • Node is restarted
      • Upon recovery, we replay the entire log. Now appliedIndex == 12
      • After recovery, we join the group and receive "truncate suffix command" in order to deleted uncommitted entries.
      • We must delete entry 12, but it's already applied. Peer is broken.

      The reason is that we don't use default recovery procedure: org.apache.ignite.raft.jraft.core.NodeImpl#init

      Canonical raft doesn't replay log before join is complete.

      Down to earth scenario, that shows this situation in practice:

      • Start group with 3 nodes: A, B, and C.
      • We assume that A is a leader.
      • Shutdown A, leader re-election is triggered.
      • We assume that B votes for C.
      • C receives grant from B and proceeds writing new configuration into local log.
      • Shutdown B before it writes the same log entry (easily-reproducible race).
      • Shutdown C.
      • Restart cluster.

      Resulting states:

      A - [1: initial cfg]

      B - [1: initial cfg]

      C - [1: initial cfg, 2: re-election]

      How to fix

      option a. Recover log after join. This is not optimal, it's like performing local recovery after cluster activation in Ignite 2. We fixed that behavior long time ago.

      option b. Somehow track committed index and perform partial recovery, that guarantees safety. We could write committed index into log storage periodically.

      "b" is better, but maybe there are other ways as well.

      Upd #1

      Highly likely we just can remove all that await log replay code on raft node start just because it’s no longer needed. Eventually it was introduced in order to enable primary replica direct storage reads, which is now covered properly within

      /**
       * Tries to read index from group leader and wait for this index to appear in local storage. Can possible return failed future with
       * timeout exception, and in this case, replica would not answer to placement driver, because the response is useless. Placement driver
       * should handle this.
       *
       * @param expirationTime Lease expiration time.
       * @return Future that is completed when local storage catches up the index that is actual for leader on the moment of request.
       */
      private CompletableFuture<Void> waitForActualState(long expirationTime) {
          LOG.info("Waiting for actual storage state, group=" + groupId());
      
          long timeout = expirationTime - currentTimeMillis();
          if (timeout <= 0) {
              return failedFuture(new TimeoutException());
          }
      
          return retryOperationUntilSuccess(raftClient::readIndex, e -> currentTimeMillis() > expirationTime, executor)
                  .orTimeout(timeout, TimeUnit.MILLISECONDS)
                  .thenCompose(storageIndexTracker::waitFor);
      }

      similar is about RO access, we await the safeTime that has HB relations with corresponding storage updates.

      Attachments

        Issue Links

          Activity

            People

              alapin Alexander Lapin
              ibessonov Ivan Bessonov
              Vladislav Pyatkov Vladislav Pyatkov
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m