Cassandra
  1. Cassandra
  2. CASSANDRA-2498

Improve read performance in update-intensive workload

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Fix Version/s: 1.0.0
    • Component/s: Core
    • Labels:

      Description

      Read performance in an update-heavy environment relies heavily on compaction to maintain good throughput. (This is not the case for workloads where rows are only inserted once, because the bloom filter keeps us from having to check sstables unnecessarily.)

      Very early versions of Cassandra attempted to mitigate this by checking sstables in descending generation order (mostly equivalent to descending mtime): once all the requested columns were found, it would not check any older sstables.

      This was incorrect, because data timestamp will not correspond to sstable timestamp, both because compaction has the side effect of "refreshing" data to a newer sstable, and because hintead handoff may send us data older than what we already have.

      Instead, we could create a per-sstable piece of metadata containing the most recent (client-specified) timestamp for any column in the sstable. We could then sort sstables by this timestamp instead, and perform a similar optimization (if the remaining sstable client-timestamps are older than the oldest column found in the desired result set so far, we don't need to look further). Since under almost every workload, client timestamps of data in a given sstable will tend to be similar, we expect this to cut the number of sstables down proportionally to how frequently each column in the row is updated. (If each column is updated with each write, we only have to check a single sstable.)

      This may also be useful information when deciding which SSTables to compact.

      (Note that this optimization is only appropriate for named-column queries, not slice queries, since we don't know what non-overlapping columns may exist in older sstables.)

      1. supersede-name-filter-collations.patch
        14 kB
        Daniel Doubleday
      2. 2498-v4.txt
        25 kB
        Jonathan Ellis
      3. 2498-v3.txt
        22 kB
        Jonathan Ellis
      4. 2498-v2.txt
        15 kB
        Jonathan Ellis

        Issue Links

          Activity

          Hide
          Peter Schuller added a comment -

          Sounds good; a trade-off to keep in mind is that it optimizes for throughput (number of sstables touched) at the cost of latency (because of the inability to submit read requests concurrently since they become dependent). Probably makes sense for most workloads.

          Show
          Peter Schuller added a comment - Sounds good; a trade-off to keep in mind is that it optimizes for throughput (number of sstables touched) at the cost of latency (because of the inability to submit read requests concurrently since they become dependent). Probably makes sense for most workloads.
          Hide
          Jonathan Ellis added a comment -

          We've always been pretty up front about optimizing for throughput over latency, so this is nothing new... But I think this wins on latency, too (as long as our architecture remains single-thread per read request), even on a best-case system where i/o is free, since deserializing + merging takes CPU too.

          Show
          Jonathan Ellis added a comment - We've always been pretty up front about optimizing for throughput over latency, so this is nothing new... But I think this wins on latency, too (as long as our architecture remains single-thread per read request), even on a best-case system where i/o is free, since deserializing + merging takes CPU too.
          Hide
          Ryan King added a comment -

          In addition to update-heavy operations. Column Families with wide rows need some love on the latency side too.

          Show
          Ryan King added a comment - In addition to update-heavy operations. Column Families with wide rows need some love on the latency side too.
          Hide
          Stu Hood added a comment - - edited

          Linking to 1608, which proposed some related approaches:

          1. Partitioning by key/column so that the data size per partition is lower and requires less sstables
          2. Implementing Lucene-like deletes/supersedes, which would be written in newer files to prevent reads to older files at a key/block level

          EDIT: I think this ticket actually proposes most of the "supersede" idea at the column level. Additionally, given range metadata (as described on http://wiki.apache.org/cassandra/FileFormatDesignDoc), you could place a range tombstone in a newer file, which could prevent reads to older files. Once those primitives exist, there are very interesting optimizations available: I've described one on CASSANDRA-2503.

          Show
          Stu Hood added a comment - - edited Linking to 1608, which proposed some related approaches: Partitioning by key/column so that the data size per partition is lower and requires less sstables Implementing Lucene-like deletes/supersedes, which would be written in newer files to prevent reads to older files at a key/block level EDIT: I think this ticket actually proposes most of the "supersede" idea at the column level. Additionally, given range metadata (as described on http://wiki.apache.org/cassandra/FileFormatDesignDoc ), you could place a range tombstone in a newer file, which could prevent reads to older files. Once those primitives exist, there are very interesting optimizations available: I've described one on CASSANDRA-2503 .
          Hide
          Jonathan Ellis added a comment -

          We could add this metadata into the -statistics component instead of adding a new one. (Just default it to zero on old-version sstables.)

          Show
          Jonathan Ellis added a comment - We could add this metadata into the -statistics component instead of adding a new one. (Just default it to zero on old-version sstables.)
          Hide
          Alan Liang added a comment -

          Other tickets require this functionality. This attempts to avoid duplicate efforts.

          Show
          Alan Liang added a comment - Other tickets require this functionality. This attempts to avoid duplicate efforts.
          Hide
          Jonathan Ellis added a comment -

          Column Families with wide rows need some love on the latency side too.

          Yes, and (I think this is where you were going with this, but to be explicit) this will help with those too, IF the queries are for most-recent-data. Which is the most common kind of wide row from what I've seen.

          Show
          Jonathan Ellis added a comment - Column Families with wide rows need some love on the latency side too. Yes, and (I think this is where you were going with this, but to be explicit) this will help with those too, IF the queries are for most-recent-data. Which is the most common kind of wide row from what I've seen.
          Hide
          Jonathan Ellis added a comment -

          IF the queries are for most-recent-data

          Actually... I can't think of a good way for us to infer that (say) higher timeuuid comparators always correspond to higher column timestamps.

          Show
          Jonathan Ellis added a comment - IF the queries are for most-recent-data Actually... I can't think of a good way for us to infer that (say) higher timeuuid comparators always correspond to higher column timestamps.
          Hide
          Stu Hood added a comment -

          Actually... I can't think of a good way for us to infer that (say) higher timeuuid comparators always correspond to higher column timestamps.

          CASSANDRA-2319 handles append-only wide row cases with an index lookup... I'm not sure they're worth worrying about for this ticket.

          Usecases involving slicing really need range/slice metadata to apply this type of optimization.

          Show
          Stu Hood added a comment - Actually... I can't think of a good way for us to infer that (say) higher timeuuid comparators always correspond to higher column timestamps. CASSANDRA-2319 handles append-only wide row cases with an index lookup... I'm not sure they're worth worrying about for this ticket. Usecases involving slicing really need range/slice metadata to apply this type of optimization.
          Hide
          Jonathan Ellis added a comment -

          CASSANDRA-2319 handles append-only wide row cases with an index lookup

          But it still has to do the lookup-per-sstable, right?

          Usecases involving slicing really need range/slice metadata to apply this type of optimization

          Yes. It would be easy to add an option like that for CfDef here, though.

          Show
          Jonathan Ellis added a comment - CASSANDRA-2319 handles append-only wide row cases with an index lookup But it still has to do the lookup-per-sstable, right? Usecases involving slicing really need range/slice metadata to apply this type of optimization Yes. It would be easy to add an option like that for CfDef here, though.
          Hide
          Daniel Doubleday added a comment -

          Took a shot at this one.

          I saw 2 ways of doing this:

          • Implementing lazy versions of column iterators
          • Doing multiple collations while reducing the filter columns when one can be sure that the column will supersede

          Problem with second choice is that everything in the query filter code assumes that all column iterators are collated at once but since first choice seemed to be a lot of effort I tried multiple collations anyway.

          So that's the plan:

          • collect and collate mem tables
          • while there are columns in the filter that are not known to supersede any other iterate over sorted sstabled and remove all cols that supersede.
          • stop if no cols are left in the filter

          Everything takes place in a new CollationController.

          So far I found one ugly edge case that comes up with system table that have 0 grace time.

          If you guys think that's a worthwhile approach I'll provide tests. Standard test suite obviously runs.

          I guess it would be easy to extend that approach to slice filters and skinny rows when implementing CASSANDRA-2503. The only thing that would be needed is a superseding timestamp in the header of a row / CF same as DeletionInfo and probably a configuration option per CF. If too many sstable are read than one could compact that row and put it in memtable.

          Also with a little (or more) work it might be interesting to see if superseding range information could be stored on block level (row index).

          Show
          Daniel Doubleday added a comment - Took a shot at this one. I saw 2 ways of doing this: Implementing lazy versions of column iterators Doing multiple collations while reducing the filter columns when one can be sure that the column will supersede Problem with second choice is that everything in the query filter code assumes that all column iterators are collated at once but since first choice seemed to be a lot of effort I tried multiple collations anyway. So that's the plan: collect and collate mem tables while there are columns in the filter that are not known to supersede any other iterate over sorted sstabled and remove all cols that supersede. stop if no cols are left in the filter Everything takes place in a new CollationController. So far I found one ugly edge case that comes up with system table that have 0 grace time. If you guys think that's a worthwhile approach I'll provide tests. Standard test suite obviously runs. I guess it would be easy to extend that approach to slice filters and skinny rows when implementing CASSANDRA-2503 . The only thing that would be needed is a superseding timestamp in the header of a row / CF same as DeletionInfo and probably a configuration option per CF. If too many sstable are read than one could compact that row and put it in memtable. Also with a little (or more) work it might be interesting to see if superseding range information could be stored on block level (row index).
          Hide
          Jonathan Ellis added a comment -

          Daniel, this is great!

          I've rebased and attached a v2 that makes some minor changes. The biggest is to avoid the tombstone collection by avoiding full collateColumns until the end. (This should be more efficient as well.) Is the same problem present with the memtable iterators?

          Show
          Jonathan Ellis added a comment - Daniel, this is great! I've rebased and attached a v2 that makes some minor changes. The biggest is to avoid the tombstone collection by avoiding full collateColumns until the end. (This should be more efficient as well.) Is the same problem present with the memtable iterators?
          Hide
          Stu Hood added a comment - - edited
          • CollationController.collectSSTablesWith*Filter should move to the filter implementations, right?
          • Could maintain the SSTables in DataTracker.View in sorted order according to SSTable.sortNewestDataFirst
          • The comment "Caller is responsible for final removeDeleted" isn't relevant to collectSSTablesWithNameFilter

          Otherwise, this looks great. Thanks Daniel!

          EDIT: actually, we'll need to special case Counters here unfortunately. It would be great to be able to find a way to apply this short-circuiting to Counters in the future though: perhaps by strongly enforcing 2503.

          Show
          Stu Hood added a comment - - edited CollationController.collectSSTablesWith*Filter should move to the filter implementations, right? Could maintain the SSTables in DataTracker.View in sorted order according to SSTable.sortNewestDataFirst The comment "Caller is responsible for final removeDeleted" isn't relevant to collectSSTablesWithNameFilter Otherwise, this looks great. Thanks Daniel! EDIT: actually, we'll need to special case Counters here unfortunately. It would be great to be able to find a way to apply this short-circuiting to Counters in the future though: perhaps by strongly enforcing 2503.
          Hide
          Jonathan Ellis added a comment -

          Getting seriously depressed at the amount of counter-special-casing going on.

          Show
          Jonathan Ellis added a comment - Getting seriously depressed at the amount of counter-special-casing going on.
          Hide
          Ryan King added a comment -

          Not sure what we can do about that unless we make counters idempotent, which may not feasible.

          Show
          Ryan King added a comment - Not sure what we can do about that unless we make counters idempotent, which may not feasible.
          Hide
          Jonathan Ellis added a comment -

          v3 attached.

          CollationController.collectSSTablesWith*Filter should move to the filter implementations, right?

          Feels like one step forward, one step back to me, since that starts to de-encapsulate "iterables." Also, special-casing counters means that we use the "slice" method for name-based counter queries. (Renamed the methods to reflect this.)

          Could maintain the SSTables in DataTracker.View in sorted order according to SSTable.sortNewestDataFirst

          Good idea. Done.

          The comment "Caller is responsible for final removeDeleted" isn't relevant to collectSSTablesWithNameFilter

          Are you sure? It's not the immediate caller anymore, but we're still going back up to CFS.getCF via getTLC, and we still need to do a final removeDeleted there.

          we'll need to special case Counters here

          Done.

          Show
          Jonathan Ellis added a comment - v3 attached. CollationController.collectSSTablesWith*Filter should move to the filter implementations, right? Feels like one step forward, one step back to me, since that starts to de-encapsulate "iterables." Also, special-casing counters means that we use the "slice" method for name-based counter queries. (Renamed the methods to reflect this.) Could maintain the SSTables in DataTracker.View in sorted order according to SSTable.sortNewestDataFirst Good idea. Done. The comment "Caller is responsible for final removeDeleted" isn't relevant to collectSSTablesWithNameFilter Are you sure? It's not the immediate caller anymore, but we're still going back up to CFS.getCF via getTLC, and we still need to do a final removeDeleted there. we'll need to special case Counters here Done.
          Hide
          Daniel Doubleday added a comment -

          avoid the tombstone collection by avoiding full collateColumns until the end.

          Very nice and clean and solves the tombstone problem.

          +1 Looks all good to me

          Show
          Daniel Doubleday added a comment - avoid the tombstone collection by avoiding full collateColumns until the end. Very nice and clean and solves the tombstone problem. +1 Looks all good to me
          Hide
          Jonathan Ellis added a comment -

          Forgot about this part:

          Is the same [premature tombstone supression] problem present with the memtable iterators?

          v3 attached to fix. Also adds a test and cleans up messy use of instance fields to preserve state between methods.

          Show
          Jonathan Ellis added a comment - Forgot about this part: Is the same [premature tombstone supression] problem present with the memtable iterators? v3 attached to fix. Also adds a test and cleans up messy use of instance fields to preserve state between methods.
          Hide
          Jonathan Ellis added a comment -

          Oops, already had v3. renamed to v4.

          Show
          Jonathan Ellis added a comment - Oops, already had v3. renamed to v4.
          Hide
          Daniel Doubleday added a comment -

          -1 on me for not even seeing that v3 wasn't collection memtables

          Show
          Daniel Doubleday added a comment - -1 on me for not even seeing that v3 wasn't collection memtables
          Hide
          Jonathan Ellis added a comment -

          (Is that a +1 on v4 or are you still reviewing?)

          Show
          Jonathan Ellis added a comment - (Is that a +1 on v4 or are you still reviewing?)
          Hide
          Daniel Doubleday added a comment -

          That would be a +1 on v4

          nit: the test doesn't look right to me (can't really check right now cause I'm away form a real keyboard) but

          the last assertEquals tests:

          +        assertEquals(0, cfs.getRecentSSTablesPerReadHistogram()[0]);
          

          but this should surely read

          +        assertEquals(1, cfs.getRecentSSTablesPerReadHistogram()[0]);
          

          I guess you wanted to do one more write without flushing and do the '0' test ...

          Show
          Daniel Doubleday added a comment - That would be a +1 on v4 nit: the test doesn't look right to me (can't really check right now cause I'm away form a real keyboard) but the last assertEquals tests: + assertEquals(0, cfs.getRecentSSTablesPerReadHistogram()[0]); but this should surely read + assertEquals(1, cfs.getRecentSSTablesPerReadHistogram()[0]); I guess you wanted to do one more write without flushing and do the '0' test ...
          Hide
          Jonathan Ellis added a comment -

          You're right. (Screwed that up changing it from an "assert" to "assertEquals.") Will fix on commit.

          Show
          Jonathan Ellis added a comment - You're right. (Screwed that up changing it from an "assert" to "assertEquals.") Will fix on commit.
          Hide
          Hudson added a comment -

          Integrated in Cassandra #1039 (See https://builds.apache.org/job/Cassandra/1039/)
          Stop reading from sstables once we know we have the most recent columns
          patch by Daniel Lundin and jbellis for CASSANDRA-2498

          jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1159942
          Files :

          • /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTable.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/CollationController.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/SuperColumn.java
          • /cassandra/trunk/CHANGES.txt
          • /cassandra/trunk/src/java/org/apache/cassandra/db/Column.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/IColumn.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/db/compaction/CompactionsTest.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/DataTracker.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/columniterator/SSTableSliceIterator.java
          • /cassandra/trunk/test/unit/org/apache/cassandra/db/ColumnFamilyStoreTest.java
          • /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
          Show
          Hudson added a comment - Integrated in Cassandra #1039 (See https://builds.apache.org/job/Cassandra/1039/ ) Stop reading from sstables once we know we have the most recent columns patch by Daniel Lundin and jbellis for CASSANDRA-2498 jbellis : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1159942 Files : /cassandra/trunk/src/java/org/apache/cassandra/io/sstable/SSTable.java /cassandra/trunk/src/java/org/apache/cassandra/db/CollationController.java /cassandra/trunk/src/java/org/apache/cassandra/db/SuperColumn.java /cassandra/trunk/CHANGES.txt /cassandra/trunk/src/java/org/apache/cassandra/db/Column.java /cassandra/trunk/src/java/org/apache/cassandra/db/IColumn.java /cassandra/trunk/test/unit/org/apache/cassandra/db/compaction/CompactionsTest.java /cassandra/trunk/src/java/org/apache/cassandra/db/DataTracker.java /cassandra/trunk/src/java/org/apache/cassandra/db/columniterator/SSTableSliceIterator.java /cassandra/trunk/test/unit/org/apache/cassandra/db/ColumnFamilyStoreTest.java /cassandra/trunk/src/java/org/apache/cassandra/db/ColumnFamilyStore.java

            People

            • Assignee:
              Jonathan Ellis
              Reporter:
              Jonathan Ellis
              Reviewer:
              Jonathan Ellis
            • Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development