Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-16226

COMPACT STORAGE SSTables created before 3.0 are not correctly skipped by timestamp due to missing primary key liveness info

    XMLWordPrintableJSON

    Details

    • Bug Category:
      Degradation - Performance Bug/Regression
    • Severity:
      Critical
    • Complexity:
      Normal
    • Discovered By:
      User Report
    • Platform:
      All
    • Impacts:
      None
    • Since Version:
    • Test and Documentation Plan:
      Hide

      The patch includes a series of new tests in SSTablesIteratedTest that verify expected numbers of SSTables read for several query types across compact and non-compact tables. They should serve as reasonable documentation and guardrails against further regression.

      The official docs on compact storage and the in-tree docs (in ddl.rst)will need some rework as well, both to indicate that it will live on in 4.0, and to take into account the concerns in this issue.

      Show
      The patch includes a series of new tests in SSTablesIteratedTest that verify expected numbers of SSTables read for several query types across compact and non-compact tables. They should serve as reasonable documentation and guardrails against further regression. The official docs on compact storage and the in-tree docs (in ddl.rst )will need some rework as well, both to indicate that it will live on in 4.0, and to take into account the concerns in this issue.

      Description

      This was discovered while tracking down a spike in the number of SSTables per read for a COMPACT STORAGE table after a 2.1 -> 3.0 upgrade. Before 3.0, there is no direct analog of 3.0's primary key liveness info. When we upgrade 2.1 COMPACT STORAGE SSTables to the mf format, we simply don't write row timestamps, even if the original mutations were INSERTs. On read, when we look at SSTables in order from newest to oldest max timestamp, we expect to have this primary key liveness information to determine whether we can skip older SSTables after finding completely populated rows.

      ex. I have three SSTables in a COMPACT STORAGE table with max timestamps 1000, 2000, and 3000. There are many rows in a particular partition, making filtering on the min and max clustering effectively a no-op. All data is inserted, and there are no partial updates. A fully specified row with timestamp 2500 exists in the SSTable with a max timestamp of 3000. With a proper row timestamp in hand, we can easily ignore the SSTables w/ max timestamps of 1000 and 2000. Without it, we read 3 SSTables instead of 1, which likely means a significant performance regression.

      The following test illustrates this difference in behavior between 2.1 and 3.0:
      https://github.com/maedhroz/cassandra/commit/84ce9242bedd735ca79d4f06007d127de6a82800

      A solution here might be as simple as having SinglePartitionReadCommand#canRemoveRow() only inspect primary key liveness information for non-compact/CQL tables. Tombstones seem to be handled at a level above that anyway. (One potential problem with that is whether or not the distinction will continue to exist in 4.0, and dropping compact storage from a table doesn't magically make pk liveness information appear.)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                maedhroz Caleb Rackliffe
                Reporter:
                maedhroz Caleb Rackliffe
                Authors:
                Caleb Rackliffe, Michael Semb Wever
                Reviewers:
                Alex Petrov, Michael Semb Wever
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 5h 10m
                  5h 10m