Details
-
Bug
-
Status: Open
-
Normal
-
Resolution: Unresolved
-
None
-
Normal
Description
Read tests get an unrealistic boost in performance because they read data from a set of partitions that was written sequentially.
I ran into this while running a timed read test against a large data set (250 million partition keys)
cassandra-stress read duration=30m
While the test was running, I noticed one node was performing zero IO after an initial period.
I discovered each node in the cluster only had blocks from a single SSTable loaded in the FS cache.
vmtouch -v /path/to/sstables
For the node that was performing zero IO, the SSTable in question was small enough to fit into the FS cache.
I realized that when a read test is run for a duration or until rate convergenge, the default population for the seeds is a GAUSSIAN distribution over the first million seeds. Because of the way compaction works, partitions that are written sequentially will (with high probability) always live in the same SSTable. That means that while the first million seeds will generate partition keys that will be randomly distributed in the token space, they will most likely all live in the same SSTable. When this SSTable is small enough to fit into the FS cache, you get unbelievably good results for a read test. Consider that a dataset 4x the size of the FS cache will have almost 1/2 the data in SSTables small enough to fit into the FS cache.
Adjusting the population of seeds used during the read test to be the entire 250 million seeds used to load the cluster does not fix the problem.
cassandra-stress read duration=30m -pop dist=gaussian(1..250M)
or (same population, larger sample)
cassandra-stress read n=250M
Any distribution other than the uniform distribution has one or more modes, and the mode(s) of such a distribution will cluster reads around a certain seed range which corresponds to a certain set of sequential writes which corresponds to (with high probability) a single SSTable.
My patch against cassandra-3.11 fixes this by shuffling the sequence of generated seeds. Each seed value will still be generated once and only once. The old behavior of sequential seed generation (ie seed(n+1) = seed( n) + 1) may be selected by using the no-shuffle flag. e.g.
cassandra-stress read duration=30m -pop no-shuffle
Results: In vmtouch-before.txt only pages from a single SSTable are present in the FS cache while in vmtouch-after.txt an equal proportion of all SSTables are present in the FS cache.