Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-2904

Master shouldn't allow master tablet operations after a disk failure



    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 1.11.0
    • Fix Version/s: 1.12.0
    • Component/s: fs, master
    • Labels:


      The master doesn't register any FS error handlers, which means that in the event of a disk failure that doesn't intrinsically crash the server (i.e. a disk failure to one of several directories), the master tablet is not failed and may undergo additional MM ops. This is forbidden: the invariant is that a tablet with a failed disk should itself fail. In the master perhaps the behavior should be more severe (i.e. perhaps the master should crash itself).

      This surfaced with a user report of multiple minor delta compactions on a master even after one of them had failed during a SyncDir() call on its superblock flush. The metadata was corrupt: the blocks added to the superblock by the compaction were marked as deleted in the LBM. It's unclear whether the in-memory state of the superblock was corrupted by the failure and subsequent compactions, or whether the corruption was caused by something else. Either way, no operations should have been permitted following the initial failure.




            • Assignee:
              bankim Bankim Bhavsar
              adar Adar Dembo
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: