The fix for
KUDU-1549 added support for deleting full log block manager containers with no live blocks, and for compacting container metadata to omit CREATE/DELETE record pairs. Both of these will help reduce the amount of metadata that must be read at startup. However, there's more we can do to help; this JIRA captures some additional ideas worth exploring (if/when LBM startup once again becomes intolerable):
In this gerrit, Todd made the case that container metadata processing is seek-dominant:
looking at a data/ dir on a cluster that has been around for quite some time, most of the metadata files seem to be around 400KB. Assuming 100MB/sec sequential throughput and 10ms seek, it definitely seems like the startup time would be seek-dominated (10 or 20ms seek depending whether various internal metadata pages are hot in cache, plus only 4ms of sequential read time).
We theorized several ways to reduce seeking, all focused on reducing the number of discrete container metadata files read at startup:
- Raise the container max data file size. This won't help on older versions of el6 with ext4, but will help everywhere else. It makes sense for the max data file size to be a function of the disk size anyway. And it's a pretty cheap way to extract more scalability.
- Reuse container data file holes, explicitly to avoid creating so many containers. Perhaps with a round of "defragmentation" to simplify reuse, or perhaps not. As a side effect, metadata file compaction now becomes more important (and costly).
- Eschew one metadata file per data file altogether and maintain just one metadata file. Deleting "dead" containers would no longer be an improvement for metadata startup cost. Metadata compaction would be a lot more expensive. Block records themselves would be larger, because each record now needs to point to a particular data file, though this can be mitigated in various ways. A variant of this would be to do away with the 1-1 relationship between metadata and data files and make it more like m-n.
- Reduce the number of extents in container metadata files via judicious preallocation.
See the gerrit linked above for more details.