[KUDU-1097] Higher availability re-replication support - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: Public beta
Fix Version/s: 1.7.0
Component/s: consensus
Labels:
None

Target Version/s:

1.7.0

Description

Relative to the re-replication support outlined in ~~KUDU-1096~~, we can do better in terms of availability properties. Here is a rough outline of such a design.

Design:

When a voter falls behind the leader's log GC threshold, the leader notifies the Master that the voter is no longer up to date.
The Master selects a node to act as a replacement. It adds that node as a PRE_VOTER to the config (see ~~KUDU-869~~) and when that node is caught up, it is automatically promoted to a VOTER.
When the Master detects that the node has been promoted, it removes the bad node from the config.

Additional cases to detect and handle:

If the config is in such a state that it would be impossible to add a node, due to a voter that has fallen behind the log GC threshold being in the required majority, then remotely bootstrap that voter without changing the config. The tablet will continue to be unable to serve writes during this time, but will self-heal without administrator intervention.

This can be further improved by adding support for aborting a config-change operation that cannot commit.

This requires some additional plumbing from the leader to the Master to notify it of slow followers.

Pros:

Closer to optimal fault-tolerance properties; "majority lost" less likely to occur so administrator intervention less likely

Cons:

Requires support for pre-voter and a smarter master.

Attachments

Issue Links

is blocked by

KUDU-869 Support PRE_VOTER config membership type

Resolved

KUDU-1033 Capability to delete & bootstrap followers that fall too far behind log

Resolved

KUDU-1194 consensus: Allow abort of uncommittable config change ops

Open

is related to

KUDU-1096 Re-replication support for Kudu beta

Resolved

relates to

KUDU-1449 tablet unavailable caused by follower can not upgrade to leader.

Resolved

Activity

People

Assignee:: Mike Percy

Reporter:: Mike Percy

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 01/Sep/15 18:56

Updated:: 23/Mar/18 21:29

Resolved:: 23/Mar/18 21:29