[KUDU-1840] Tolerate disk failures on single tablet servers - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: n/a
Component/s: fs
Labels:
None

Description

The way we store data on disk is akin to striping or RAID 0, losing one disk means that the rest of the data isn't recoverable on the other disks.

Users would see something like after replacing a bad disk:

an 18, 10:20:55.693 AM  INFO  server_base.cc:179  
Could not load existing FS layout: Not found: /data/4/kudu/instance: No such file or directory (error 2)
Jan 18, 10:20:55.693 AM  INFO  server_base.cc:180  
Creating new FS layout
Jan 18, 10:20:55.693 AM  FATAL  tablet_server_main.cc:64  
Check failed: _s.ok() Bad status: Already present: Could not create new FS layout: FSManager root is not empty: /data/1/kudu-wal

The above shows a tablet server figuring out that one folder is empty, but then that other folders have data so it crashes. Currently the workaround is to manually delete the data in all the remaining Kudu folders.

As we fix this, one thing to keep in mind is that WALs can only be stored on one disk, so even if we tolerate data disk failures it would still not help if the WALs' disk dies.

Attachments

Issue Links

relates to

KUDU-616 Mitigate tablet damage when disks are lost

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jean-Daniel Cryans

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Jan/17 19:29

Updated:: 19/Jan/17 17:14

Resolved:: 19/Jan/17 17:14