[KUDU-1194] consensus: Allow abort of uncommittable config change ops - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: consensus
Labels:
None

Description

Wanted to capture a few thoughts about manually fixing broken configs or automatically rolling back bad config changes. This isn't a fully baked design, just wanted to jot down some initial thoughts.

A general way to (attempt to) abort uncommitted ops is to truncate the Raft log on the leader (and replace the op with a NO_OP or something similar).

Some thoughts on recovering from "bad" configs:

We may hit a situation where there is an in-progress config change operation that will be impossible to commit due to a majority of the nodes in the "target" config being permanently dead. If the leader is still alive, we can provide a timeout on these ops or a way to explicitly (via RPC) abort them by truncating the log.
If no leader is alive, and it's impossible to elect one, then we could write an "unsafe" tool only for emergency use that could do something evil like make the follower think that the tool is the new leader and append an unsafe change-config op to the follower's log.

Attachments

Issue Links

blocks

KUDU-1097 Higher availability re-replication support

Resolved

is duplicated by

KUDU-1668 Add support for aborting a config-change operation that cannot commit

Resolved

Activity

People

Assignee:: Mike Percy

Reporter:: Mike Percy

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 26/Sep/15 22:50

Updated:: 25/Aug/17 21:26