Uploaded image for project: 'Kudu'
  1. Kudu
  2. KUDU-1840

Tolerate disk failures on single tablet servers

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: n/a
    • Component/s: fs
    • Labels:
      None

      Description

      The way we store data on disk is akin to striping or RAID 0, losing one disk means that the rest of the data isn't recoverable on the other disks.

      Users would see something like after replacing a bad disk:

      an 18, 10:20:55.693 AM  INFO  server_base.cc:179  
      Could not load existing FS layout: Not found: /data/4/kudu/instance: No such file or directory (error 2)
      Jan 18, 10:20:55.693 AM  INFO  server_base.cc:180  
      Creating new FS layout
      Jan 18, 10:20:55.693 AM  FATAL  tablet_server_main.cc:64  
      Check failed: _s.ok() Bad status: Already present: Could not create new FS layout: FSManager root is not empty: /data/1/kudu-wal
      

      The above shows a tablet server figuring out that one folder is empty, but then that other folders have data so it crashes. Currently the workaround is to manually delete the data in all the remaining Kudu folders.

      As we fix this, one thing to keep in mind is that WALs can only be stored on one disk, so even if we tolerate data disk failures it would still not help if the WALs' disk dies.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jdcryans Jean-Daniel Cryans
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: