From time to time we get user reports of durability issues in Kudu. We try to be good citizens and obey the POSIX spec w.r.t. durably storing data on disk, but we lack any sort of tests that prove we're doing this correctly.
Ideally, we'd have a framework that allows us to run a standard Kudu workload while doing pathological things to a subset of nodes like:
- Panicking the Linux kernel.
- Abruptly cutting power.
- Abruptly unmounting a filesystem or yanking a disk.
Then we'd restart Kudu on the affected nodes and prove that all on-disk data remains consistent.