nodetool garbagecollect always outputs to L0 with LeveledCompactionStrategy.
This is awful. On a large LCS table, this means that at the end of the garbagecollect process, all data is in L0.
This results in an awful sequence of useless temporary space usage and write amplification:
- L0 is repeatedly size-tiered compacted until it doesn't have too many SSTables. If the original LCS table had 2000 tables... this takes a long time
- L0 is compacted to L1 in one to a couple very very large compactions
- L1 is compacted to L2, L3 to L4, etc. Write amplification galore
Due to the above, 'nodetool garbagecollect' is close to worthless for large LCS tables. A full compaction is always less write amplification and similar temp disk space required. The only exception is if you can use 'nodetool garbagecolect' part-way, and then use 'nodetool stop' to cancel it before L0 is too large. In this case if you are lucky, and the order that it chose to process SSTables coincides with tables that have the most disk space to clear, you might free up enough disk space to succeed in your original goal.
However, from what I can tell, there is no good reason to move the output to L0. Leaving the output table in the same SSTableLevel as the source table does not violate any of the LeveledCompactionStrategy placement rules, as the output by definition has a token range equal to or smaller than the source.
The only drawback is if the size of the output files is significantly smaller than the source, in which case the source level would be under-sized. But that seems like a problem that LCS has to handle, not garbagecollect.
LCS could have a "pull up" operation where it does something like the following. Assume a table has L4 as the max level, and L3 and L4 are both 'under-sized'. L3 can attempt to 'pull up' any tables from L4 that do not overlap with the token ranges of the L3 tables. After that, it can choose to do some compactions that mix L3 and L4 to pull up data into L3 if it is still significantly under-sized.
From what I can tell, garbagecollect should just re-write tables in place, and leave the compaction strategy to deal with any consequences.
Moving to L0 is a bad idea. In addition to the extra write amplification and extreme increase in temporary disk space required, I observed the following:
A 'nodetool garbagecollect' was placing a lot of pressure on a L0 of a node. We stopped it about 20% through the process, and it managed to compact down the top couple levels. So we tried to run 'garbagecollect' again, but the first tables it chose to operate on were in L1, not the 'leafs' in L5! This was because the order of SSTables chosen currently does not consider the level, and instead looks purely at the max timestamp in the file. But because we moved very old data from L5 into L0 as a result of the prior gabagecollect, manytables in L1 and L2 now had very wide ranges between their min and max timestamps – essentially some of the oldest and newest data all in one table. This breaks the usual structure of an LCS table where the oldest data is at the high levels.
I hope that others agree that this is a bug, and deserving of a fix.
I have a very simple patch for this that I will be creating a PR for soon. 3 lines for the code change, 70 lines for a new unit test.