[KUDU-2950] Support restarting nodes in batches - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: ops-tooling
Labels:
None

Description

Once Kudu has the building blocks to orchestrate a rolling restart, it'd be great if we could support restarting multiple nodes at a time.

Location awareness would play a crucial role in this because, if used to identify racks placement, we could bring down an entire rack at a time if we wanted. If we did this, though, during the controlled restart of a given rack, Kudu would be more vulnerable to the unexpected downtime of another rack.

One approach would be to support something like HDFS's upgrade domains:

The idea is to group datanodes in a new dimension called upgrade domain, in addition to the existing rack-based grouping. For example, we can assign all datanodes in the first position of any rack to upgrade domain ud_01, nodes in the second position to upgrade domain ud_02 and so on.
...
By default, 3 replicas of any given block are placed on 3 different upgrade domains. This means all datanodes belonging to a specific upgrade domain collectively won’t store more than one replica of any block.

The decoupling of physical groups from restartable groups should make a batch restarts more robust to rack failures.

Attachments

Issue Links

depends upon

KUDU-2054 Rolling Restart and Upgrade

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Andrew Wong

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Sep/19 23:19

Updated:: 03/Jun/20 15:44