Resolution: Won't Fix
Short version: it appears that if the resulting SSTable of a compaction enters another compaction soon after, the SSTables participating in the former compaction don't get deleted from disk until Cassandra is restarted.
We have run into a big problem after applying
CASSANDRA-10276 and CASSANDRA-10280, backported to 2.0.14. But the bug we're seeing is not introduced by these patches, it has just made itself very apparent and harmful.
Here's what has happened. We had repair running on our table that is a time series and uses DTCS. The ring was split into 5016 small ranges being repaired one after the other (using parallel repair, i.e. not snapshot repair). This causes a flood of tiny SSTables to get streamed into all nodes (we don't use vnodes), with timestamp ranges similar to existing SSTables on disk. The problem with that is the sheer number of SSTables, disk usage is not affected. This has been reported before, see
CASSANDRA-9644. These SSTables are streamed continuously for up to a couple of days.
The patches were applied to fix the problem of ending up with tens of thousands of SSTables that would never get touched by DTCS. But now that DTCS does touch them, we have run into a new problem instead. While disk usage was in the 25-30% neighborhood before repairs began, disk usage started growing fast when these continuous streams started coming in. Eventually, a couple of nodes ran out of disk, which led us to stop all the repairing on the cluster.
This didn't reduce the disk usage. Compactions were of course very active. More than doubling disk usage should not be possible, regardless of the choices your compaction strategy makes. And we were not getting magnitudes of data streamed in. Large quantities of SSTables, yes, but this was the nodes creating more data out of thin air.
We have a tool to show timestamp and size metadata of SSTables. What we found, looking at all non-tmp data files, was something akin to duplicates of almost all the largest SSTables. Not quite exact replicas, but there were these multi-gigabyte SSTables covering exactly the same range of timestamps and differing in size by mere kilobytes. There were typically 3 of each of the largest SSTables, sometimes even more.
Here's what I suspect: DTCS is the only compaction strategy that would commonly finish compacting a really large SSTable and on the very next run of the compaction strategy nominate the result for yet another compaction. Even together with tiny SSTables, which certainly happens in our scenario. Potentially, the large SSTable that participated in the first compaction might even get nominated again by DTCS, if for some reason it can be returned by getUncompactingSSTables.
Whatever the reason, I have collected evidence showing that these large "duplicate" SSTables are of the same "lineage". Only one should remain on disk: the latest one. The older ones have already been compacted, resulting in the newer ones. But for some reason, they never got deleted from disk. And this was really harmful when combining DTCS with continuously streaming in tiny SSTables. The same but worse would happen without the patches and uncapped max_sstable_age_days.
Attached is one occurrence of 3 duplicated SSTables, their metadata and log lines about their compactions. You can see how similar they were to each other. SSTable generations 374277, 374249, 373702 (the large one), 374305, 374231 and 374333 completed compaction at 04:05:26,878, yet they were all still on disk over 6 hours later. At 04:05:26,898 the result, 374373, entered another compaction with 375174. They also stayed around after that compaction finished. Literally all SSTables named in these log lines were still on disk when I checked! Only one should have remained: 375189.
Now this was just one random example from the data I collected. This happened everywhere. Some SSTables should probably have been deleted a day before.
However, once we restarted the nodes, all of the duplicates were suddenly gone!