Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-13932

Stress write order and seed order should be different

    XMLWordPrintableJSON

Details

    • Normal

    Description

      Read tests get an unrealistic boost in performance because they read data from a set of partitions that was written sequentially.

      I ran into this while running a timed read test against a large data set (250 million partition keys)

      cassandra-stress read duration=30m

      While the test was running, I noticed one node was performing zero IO after an initial period.

      I discovered each node in the cluster only had blocks from a single SSTable loaded in the FS cache.

      vmtouch -v /path/to/sstables

      For the node that was performing zero IO, the SSTable in question was small enough to fit into the FS cache.

      I realized that when a read test is run for a duration or until rate convergenge, the default population for the seeds is a GAUSSIAN distribution over the first million seeds. Because of the way compaction works, partitions that are written sequentially will (with high probability) always live in the same SSTable. That means that while the first million seeds will generate partition keys that will be randomly distributed in the token space, they will most likely all live in the same SSTable. When this SSTable is small enough to fit into the FS cache, you get unbelievably good results for a read test. Consider that a dataset 4x the size of the FS cache will have almost 1/2 the data in SSTables small enough to fit into the FS cache.

      Adjusting the population of seeds used during the read test to be the entire 250 million seeds used to load the cluster does not fix the problem.

      cassandra-stress read duration=30m -pop dist=gaussian(1..250M)

      or (same population, larger sample)

      cassandra-stress read n=250M

      Any distribution other than the uniform distribution has one or more modes, and the mode(s) of such a distribution will cluster reads around a certain seed range which corresponds to a certain set of sequential writes which corresponds to (with high probability) a single SSTable.

      My patch against cassandra-3.11 fixes this by shuffling the sequence of generated seeds. Each seed value will still be generated once and only once. The old behavior of sequential seed generation (ie seed(n+1) = seed( n) + 1) may be selected by using the no-shuffle flag. e.g.

      cassandra-stress read duration=30m -pop no-shuffle

      Results: In vmtouch-before.txt only pages from a single SSTable are present in the FS cache while in vmtouch-after.txt an equal proportion of all SSTables are present in the FS cache.

      Attachments

        1. vmtouch-before.txt
          3 kB
          Daniel Cranford
        2. vmtouch-after.txt
          0.8 kB
          Daniel Cranford
        3. 0001-Initial-implementation-cassandra-3.11.patch
          14 kB
          Daniel Cranford

        Activity

          People

            Unassigned Unassigned
            daniel.cranford Daniel Cranford
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: