Description
Over the weekend a case was reported where the tablet server disks were near-full across a Kudu cluster. One finally reached the tipping point and crashed because the WAL disk was out of space and a write failed. This caused a cascading failure because the replicas on that tablet server were re-replicated to the rest of the cluster nodes, pushing them beyond the tipping point and eventually the whole cluster crashed.
We could potentially prevent the cascading failure by detecting that a tablet server is nearly full and reject or prevent attempts to move additional replicas to that server while it is in the "yellow zone" of disk space availability, preferring under-replicated tablets over an unavailable cluster.
Attachments
Issue Links
- relates to
-
KUDU-2404 Mitigate effects of full disks
- Open