Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Wanted to capture a few thoughts about manually fixing broken configs or automatically rolling back bad config changes. This isn't a fully baked design, just wanted to jot down some initial thoughts.
A general way to (attempt to) abort uncommitted ops is to truncate the Raft log on the leader (and replace the op with a NO_OP or something similar).
Some thoughts on recovering from "bad" configs:
- We may hit a situation where there is an in-progress config change operation that will be impossible to commit due to a majority of the nodes in the "target" config being permanently dead. If the leader is still alive, we can provide a timeout on these ops or a way to explicitly (via RPC) abort them by truncating the log.
- If no leader is alive, and it's impossible to elect one, then we could write an "unsafe" tool only for emergency use that could do something evil like make the follower think that the tool is the new leader and append an unsafe change-config op to the follower's log.