Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2795

Prevent cascading failures by detecting that disks are full and rejecting attempts to add additional replicas to a tablet server

    XMLWordPrintableJSON

Details

    • Task
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.8.0
    • None
    • master, tserver
    • None

    Description

      Over the weekend a case was reported where the tablet server disks were near-full across a Kudu cluster. One finally reached the tipping point and crashed because the WAL disk was out of space and a write failed. This caused a cascading failure because the replicas on that tablet server were re-replicated to the rest of the cluster nodes, pushing them beyond the tipping point and eventually the whole cluster crashed.

      We could potentially prevent the cascading failure by detecting that a tablet server is nearly full and reject or prevent attempts to move additional replicas to that server while it is in the "yellow zone" of disk space availability, preferring under-replicated tablets over an unavailable cluster.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mpercy Mike Percy
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated: