Uploaded image for project: 'Apache Cassandra'
  1. Apache Cassandra
  2. CASSANDRA-18118

Do not leak 2015 memtable synthetic Epoch

    XMLWordPrintableJSON

Details

    Description

      This Epoch can leak affecting all the timestamps logic. It has been observed in a production env it can i.e. prevent proper sstable and tombstone cleanup.

      To reproduce create the following table:

      drop keyspace test;
      create keyspace test WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1};
      CREATE TABLE test.test (
          key text PRIMARY KEY,
          id text
      ) WITH bloom_filter_fp_chance = 0.01
          AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
          AND comment = ''
          AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '2', 'tombstone_compaction_interval': '3000', 'tombstone_threshold': '0.1', 'unchecked_tombstone_compaction': 'true'}
          AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
          AND crc_check_chance = 1.0
          AND dclocal_read_repair_chance = 0.0
          AND default_time_to_live = 10
          AND gc_grace_seconds = 10
          AND max_index_interval = 2048
          AND memtable_flush_period_in_ms = 0
          AND min_index_interval = 128
          AND read_repair_chance = 0.0
          AND speculative_retry = '99PERCENTILE';
      
      CREATE INDEX id_idx ON test.test (id);
      

      And stress load it with:

      insert into test.test (key,id) values('$RANDOM_UUID $RANDOM_UUID', 'eaca36a1-45f1-469c-a3f6-3ba54220363f') USING TTL 10
      

      Notice how all inserts have a 10s TTL, the default 10s TTL and gc_grace is also at 10s. This is to speed up the repro:

      • Run the load for a couple minutes and track sstables disk usage. You will see it does only increase, nothing gets cleaned up and it doesn't stop growing (notice all this is well past the 10s gc_grace and TTL)
      • Running a flush and a compaction while under load against the keyspace, table or index doesn't solve the issue.
      • Stopping the load and running a compaction doesn't solve the issue. Flushing does though.
      • On the original observation where TTL was around 600s and gc_grace around 1800s we could get GBs of sstables that weren't cleaned up or compacted away after hours of work.
      • Reproduction can also happen on plain sstables by repeatedly inserting/deleting/overwriting the same values over and over again without 2i indices or TTL being involved.

      The problem seems to be EncodingStats using a synthetic Epoch in 2015 which plays nice with Vint serialization. Unfortunately Memtable is using that to keep track of the minTimestamp which can leak the 2015 Epoch. This confuses any logic consuming that timestamp. In this particular case purge and fully expired sstables weren't properly detected.

      Attachments

        Activity

          People

            bereng Berenguer Blasi
            bereng Berenguer Blasi
            Berenguer Blasi
            Caleb Rackliffe
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: