[IGNITE-20425] Corrupted Raft FSM state after restart - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 3.0
Component/s: None
Labels:
- ignite-3

Description

According to the protocol, there are several numeric indexes in the Log / FSM:

lastLogIndex - index of the last logged log entry.
committedIndex - index of last committed log entry. committedIndex <= lastLogIndex.
appliedIndex - index of last log entry, processed by the state machine. appliedIndex <= committedIndex.

If committed index is less then last index, RAFT can invoke the "truncate suffix" procedure and delete uncommitted log's tail. This is a valid thing to do.

Now, imagine the following scenario:

lastIndex == 12, committedIndex == 11
Node is restarted
Upon recovery, we replay the entire log. Now appliedIndex == 12
After recovery, we join the group and receive "truncate suffix command" in order to deleted uncommitted entries.
We must delete entry 12, but it's already applied. Peer is broken.

The reason is that we don't use default recovery procedure: org.apache.ignite.raft.jraft.core.NodeImpl#init

Canonical raft doesn't replay log before join is complete.

Down to earth scenario, that shows this situation in practice:

Start group with 3 nodes: A, B, and C.
We assume that A is a leader.
Shutdown A, leader re-election is triggered.
We assume that B votes for C.
C receives grant from B and proceeds writing new configuration into local log.
Shutdown B before it writes the same log entry (easily-reproducible race).
Shutdown C.
Restart cluster.

Resulting states:

A - [1: initial cfg]

B - [1: initial cfg]

C - [1: initial cfg, 2: re-election]

How to fix

option a. Recover log after join. This is not optimal, it's like performing local recovery after cluster activation in Ignite 2. We fixed that behavior long time ago.

option b. Somehow track committed index and perform partial recovery, that guarantees safety. We could write committed index into log storage periodically.

"b" is better, but maybe there are other ways as well.

Upd #1

Highly likely we just can remove all that await log replay code on raft node start just because it’s no longer needed. Eventually it was introduced in order to enable primary replica direct storage reads, which is now covered properly within

/**
 * Tries to read index from group leader and wait for this index to appear in local storage. Can possible return failed future with
 * timeout exception, and in this case, replica would not answer to placement driver, because the response is useless. Placement driver
 * should handle this.
 *
 * @param expirationTime Lease expiration time.
 * @return Future that is completed when local storage catches up the index that is actual for leader on the moment of request.
 */
private CompletableFuture<Void> waitForActualState(long expirationTime) {
    LOG.info("Waiting for actual storage state, group=" + groupId());

    long timeout = expirationTime - currentTimeMillis();
    if (timeout <= 0) {
        return failedFuture(new TimeoutException());
    }

    return retryOperationUntilSuccess(raftClient::readIndex, e -> currentTimeMillis() > expirationTime, executor)
            .orTimeout(timeout, TimeUnit.MILLISECONDS)
            .thenCompose(storageIndexTracker::waitFor);
}

similar is about RO access, we await the safeTime that has HB relations with corresponding storage updates.

Attachments

Issue Links

links to

GitHub Pull Request #2839

Activity

People

Assignee:: Alexander Lapin

Reporter:: Ivan Bessonov

Reviewer:: Vladislav Pyatkov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 15/Sep/23 08:39

Updated:: 04/Sep/24 16:10

Resolved:: 29/Nov/23 13:06

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

20m