Description
The way we store data on disk is akin to striping or RAID 0, losing one disk means that the rest of the data isn't recoverable on the other disks.
Users would see something like after replacing a bad disk:
an 18, 10:20:55.693 AM INFO server_base.cc:179 Could not load existing FS layout: Not found: /data/4/kudu/instance: No such file or directory (error 2) Jan 18, 10:20:55.693 AM INFO server_base.cc:180 Creating new FS layout Jan 18, 10:20:55.693 AM FATAL tablet_server_main.cc:64 Check failed: _s.ok() Bad status: Already present: Could not create new FS layout: FSManager root is not empty: /data/1/kudu-wal
The above shows a tablet server figuring out that one folder is empty, but then that other folders have data so it crashes. Currently the workaround is to manually delete the data in all the remaining Kudu folders.
As we fix this, one thing to keep in mind is that WALs can only be stored on one disk, so even if we tolerate data disk failures it would still not help if the WALs' disk dies.
Attachments
Issue Links
- relates to
-
KUDU-616 Mitigate tablet damage when disks are lost
- Resolved