Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2904

Master shouldn't allow master tablet operations after a disk failure

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.11.0
    • Fix Version/s: None
    • Component/s: fs, master
    • Labels:

      Description

      The master doesn't register any FS error handlers, which means that in the event of a disk failure that doesn't intrinsically crash the server (i.e. a disk failure to one of several directories), the master tablet is not failed and may undergo additional MM ops. This is forbidden: the invariant is that a tablet with a failed disk should itself fail. In the master perhaps the behavior should be more severe (i.e. perhaps the master should crash itself).

      This surfaced with a user report of multiple minor delta compactions on a master even after one of them had failed during a SyncDir() call on its superblock flush. The metadata was corrupt: the blocks added to the superblock by the compaction were marked as deleted in the LBM. It's unclear whether the in-memory state of the superblock was corrupted by the failure and subsequent compactions, or whether the corruption was caused by something else. Either way, no operations should have been permitted following the initial failure.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              adar Adar Dembo
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: