I was investigating Kudu's disk space consumption on an internal cluster and found a few interesting things. This is a 42-node cluster with three masters, running CentOS 6.6. I focused on a particular node with 11 data directories, each formatted with XFS. The node was serving a bunch of tombstoned tablets, but no actual live tablets. All the tablets belonged to one of two tables. Due to the file sizes involved, the following analysis was done on just one of the data directories.
There were 7406 "live" blocks. I put live in quotes because these blocks were orphaned by definition, as there were no live tablets. The running theory is that they were orphaned due to tablet copy operations that failed mid-way.
KUDU-1853 tracks this issue, at least with respect to non-crash failures. Failures due to a crash require a data GC of some kind, tracked in KUDU-829. The live blocks were stored in 1025 LBM containers. The vast majority of the file space in each container held punched-out dead blocks, as one might expect. Taken together, the live blocks accounted for ~85 GB of data.
However, the total disk space usage of these container files was ~123 GB. There were three discrepancies here, one tiny, one minor, and one major:
- There was ~17 MB of space lost to external fragmentation. This is because LBM containers force live blocks to be aligned to the nearest filesystem block.
- There was ~1.4 GB of dead block data that was backed by live extents according to filefrag. That is, these are dead blocks the tserver either failed to punch, or (more likely) crashed before it could punch.
- There was ~40 GB of zeroed space hanging off the edge of the container files. Unlike a typical preallocation, this space was not accounted for in the logical file size; it only manifests in filefrag or du. I believe this is due to XFS's speculative preallocation feature feature. What is worrying is that this preallocation became permanent; neither clearing the kernel's inode cache nor shutting down the tserver (the documented workarounds) made it disappear. Only an explicit ftruncate() cleared it up.
There are a couple of conclusions to draw here:
- It's good that we've fixed
KUDU-1853; that should reduce the number of orphaned blocks. However, we should prioritize KUDU-829 too, as a crash during a tablet copy can still orphan a ton of blocks, far more than a crash during a flush or compaction.
- There's also a need to re-effect hole punches in case we crash after blocks have been deleted but before the punches take place. This can be done blindly on all dead blocks in an LBM container at startup, perhaps based on some "actual disk space used > expected disk space used" threshold. Or we can use the FIEMAP ioctl to figure out exactly where the extents are, and surgically only punch those that are needed.
- On XFS, we really need to address this speculative preallocation problem. It's not clear exactly what causes this temporary phenomenon to become permanent; the XFS faq is vague on that. But, one option is to adjust the LBM truncate-full-container-file-at-startup logic to ignore the container's logical file size; that is, to always truncate the container to the end of the last block.
I've attached two scripts that helped me during the analysis. dump_all_blocks.py converts the on-disk LBM metadata files into a JSON representation. check_fragmentation.py uses the JSON representation and the output of filefrag to find fragmentation, unpunched holes, and excess preallocation. frag_report is the output of "check_fragmentation.py -v" on the JSON representation of one of the data directories.
Let's use this JIRA to track issue #3 from the above list.