I have found LCS to be superior to STCS in almost every way - except for the fact that it requires significantly more IO (a well advertised property). In leveled compaction, L n+1 is 10 times larger than L n so generally 1+10 sstables need to be compacted to promote one sstable into the next level. For certain workloads, this practically this means only 1/(10+1)=9% of the IO, specifically write IO, is doing ‘useful’ work.
But why is each level 10 times larger? Why 10? Its a pretty looking number and all but thats not a very good reason to choose it. If we chose 5 or even 2 we could reduce the ‘wasted’ io required to promote an sstable to the next level - of course at the expense of requiring more levels. I have not been able to find justification for this choice in either cassandra or leveldb itself. I would like to introduce a new parameter, the leveling multiplier, which controls the desired size difference between L n and L n+1.
First and foremost, a little math. Lets assume we have a CF of a fixed size that is receiving continuous new data (ie: data is expiring due to TTLs or is being overwritten). I believe the number of levels required is approximately (see note 1):
Which, when solving for the level count, becomes:
The amount of compaction write IO required over the lifetime of a particular piece of data (excluding compactions in L0) is:
So ultimately, the the relationship between write IO and the level multiplier is f(x) = (1 + x)/log(x) which is optimal at 3.59, or 4 if we round to the nearest integer. Also note that write IO is proportional to log((data size)/(sstable size)) which suggests using larger sstables would also reduce disk IO.
As one final analytical step we can add the following term to approximate STC in L0 (which is not actually how its implemented but should be close enough for moderate sstable sizes):
The following two graphs illustrate the predicted compaction requirements as a function of the leveling multiplier and sstable size:
In terms of empirically verifying the expected results, I set up three cassandra nodes, node A having a leveling multiplier of 10 and sstable size if 160 MB (current cassandra defaults), node B with multiplier 4 and size 160 MB, and node C with multiplier 4 and size 1024 MB. I used a simple write only workload which inserted data having a TTL of 2 days at 1 MB/second (see note 2). Compaction throttling was disabled and gc_grace was 60 seconds. All nodes had dedicated data disks and IO measurements were for the data disks only.
|Measure||Node A (10, 160MB)||Node B (4, 160MB)||Node C (4, 1024MB)|
|Predicted IO Rate||34.4 MB/s||26.2 MB/s||20.5 MB/s|
|Predicted Number of Levels (Expected Dataset of 169 GB)||3.0||5.0||3.7|
|Experimental IO Rate||32.0 MB/s||28.0 MB/s||20.4 MB/s|
|Experimental Number of Levels||~4.1||~6.1||~4.8|
|Final Dataset Size (After 88 hours)||301 GB||261 GB||258 GB|
These results indicate that Node A performed better than expected, I suspect that this was due to the fact that the data insertion rate was a little too high and compaction periodically got backlogged meaning the promotion from L0 to L1 was more efficient. Also note that the actual dataset size is larger than that used in the analytical model - which is expected as expired data will not get purged immediately. The size difference between node A and the others however seems suspicious to me.
In summary, these results, both theoretical and experimental, clearly indicate that reducing the level multiplier from 10 to 4 and increasing the sstable size reduces compaction IO. The experimental results, using an SSTable size of 1024 MB and level multiplier of 4, demonstrated a 36% reduction in write IO without a significant increase in the number of levels. I have not run benchmarks for an update heavy workload but I suspect it would benefit significantly since more data can be ‘updated’ per compaction. I have also not benchmarked read performance but I would not expect noticeable performance degradation provided an sstable size is chosen which keeps the number of levels roughly equal.
The patch I have attached is against 2.0.10 and does not change the defaults. Long term however, it would make sense to use more optimal defaults unless there is compelling counter evidence to the performance gains observed.
One final observation, in current leveled compaction the number of levels is determined by the amount of data and the user specified sstable size. A compaction strategy where instead the user selected the desired number of levels and the strategy adjusted the SSTable size based on the amount of data would have a number of benefits. The strategy would behave more consistently across a much wider range of dataset sizes. Compaction IO overhead (as a function of write rate) and worst case read performance (number of sstables per read) would both be largely independent of dataset size.
Note 1: This equation only calculates the amount of data able to fit in the largest level. It would be more accurate take into account data in smaller levels (ie: using the geometric series equation) but this is a close enough approximation. There is also the fact that redundant data might be spread across the various levels.
Note 2: This represents the entropy introduction rate and does not account for any Cassandra overhead but compression was also enabled. The row key was a long, each row had 512 columns, the column name was a UUID, and the column value was a 64 byte blob.