[KUDU-2795] Prevent cascading failures by detecting that disks are full and rejecting attempts to add additional replicas to a tablet server - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.8.0
Fix Version/s: None
Component/s: master, tserver
Labels:
None

Description

Over the weekend a case was reported where the tablet server disks were near-full across a Kudu cluster. One finally reached the tipping point and crashed because the WAL disk was out of space and a write failed. This caused a cascading failure because the replicas on that tablet server were re-replicated to the rest of the cluster nodes, pushing them beyond the tipping point and eventually the whole cluster crashed.

We could potentially prevent the cascading failure by detecting that a tablet server is nearly full and reject or prevent attempts to move additional replicas to that server while it is in the "yellow zone" of disk space availability, preferring under-replicated tablets over an unavailable cluster.

Attachments

Issue Links

relates to

KUDU-2404 Mitigate effects of full disks

Open

Activity

People

Assignee:: Unassigned

Reporter:: Mike Percy

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Apr/19 17:31

Updated:: 03/Jun/20 15:04