Since the very early days of Kudu we've used the file system as a sort of key-value store for various bits of metadata – in particular, each tablet has a "tablet metadata" file and a "consensus metadata file". We also have various per-disk metadata, as well as LBM block metadata, but I'll leave those out of the present discussion.
Although the file system "works" and got us pretty far, there are various benefits we could gain by switching to something closer to a KV-store:
- write performance: writing a single file per "record", fsyncing it, and renaming it to replace the old version is pretty slow. File systems empirically do a poor job of group-commit of multiple of these types of operations happening at the same time, especially as the disks get busy
- read performance: on startup, we need to read all these metadata files, and using separate files means it's likely they're on entirely separate areas of disk. If we have, for example, 100k tablets in the future, we'll pay 100k random reads, which could take 10+ minutes on a cold hard disk.
- scalability: if we want to get to 100k+ tablets per server, we start entering the territory where file systems are unhappy with such a high quantity of files
Solving the performance problem also opens us up a lot of headroom to do things such as replicate cmeta information for each tablet onto two or three disks, which has a lot of benefits for handling disk-failure. (you can lose a disk without getting "amnesia" of a raft vote, for example).