Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 2.0.2
    • Component/s: Core
    • Labels:
      None

      Description

      Keeping a count of reads per-sstable would allow STCS to automatically ignore cold data rather than recompacting it constantly with hot data, dramatically reducing compaction load for typical time series applications and others with time-correlated access patterns. We would not need a separate age-tiered compaction strategy.

      (This will really be useful in conjunction with CASSANDRA-5514.)

      1. 0001-Track-row-read-counts-in-SSTR.patch
        4 kB
        Tyler Hobbs
      2. 5515-2.0-v1.txt
        21 kB
        Tyler Hobbs
      3. 5515-2.0-v2.txt
        21 kB
        Tyler Hobbs

        Issue Links

          Activity

          Hide
          Tyler Hobbs added a comment -

          Should this ticket include the STCS work, or just tracking read counts? If so, I think there's a lot of room for discussion about how STCS should utilize that data.

          Show
          Tyler Hobbs added a comment - Should this ticket include the STCS work, or just tracking read counts? If so, I think there's a lot of room for discussion about how STCS should utilize that data.
          Hide
          Jonathan Ellis added a comment - - edited

          Just tracking the read counts for now, then we can build on top of it. Which might be out of scope for 2.0.x, but I don't expect that adding the tracking will be very invasive. Still, just having the information available will be useful.

          Show
          Jonathan Ellis added a comment - - edited Just tracking the read counts for now, then we can build on top of it. Which might be out of scope for 2.0.x, but I don't expect that adding the tracking will be very invasive. Still, just having the information available will be useful.
          Hide
          Tyler Hobbs added a comment -

          At first I was not planning to include range slice queries in the count, but on second thought, those may still be useful. What do you think?

          Show
          Tyler Hobbs added a comment - At first I was not planning to include range slice queries in the count, but on second thought, those may still be useful. What do you think?
          Hide
          Jonathan Ellis added a comment - - edited

          Hmm. I think either leave it out or track it separately. Those tend to be M/R type queries (that touch the entire dataset or close to it) than an indicator of hotness.

          Edit: I'd vote for leave it out, we can always add it later if we come up with a use case.

          Show
          Jonathan Ellis added a comment - - edited Hmm. I think either leave it out or track it separately. Those tend to be M/R type queries (that touch the entire dataset or close to it) than an indicator of hotness. Edit: I'd vote for leave it out, we can always add it later if we come up with a use case.
          Hide
          Tyler Hobbs added a comment -

          Sounds good.

          Patch 0001 adds a counter to SSTR and increments it in the two paths that seem to be directly involved in SSTable reads for row queries. (The changes are small enough that I feel like I may have really missed something.)

          Show
          Tyler Hobbs added a comment - Sounds good. Patch 0001 adds a counter to SSTR and increments it in the two paths that seem to be directly involved in SSTable reads for row queries. (The changes are small enough that I feel like I may have really missed something.)
          Hide
          Jonathan Ellis added a comment -

          I think we do need to persist this one across restarts, e.g., in a system table. (Maybe just a Map would be fine, since pulling 1000s of entries into memory once on startup to initialize state should be okay.)

          Show
          Jonathan Ellis added a comment - I think we do need to persist this one across restarts, e.g., in a system table. (Maybe just a Map would be fine, since pulling 1000s of entries into memory once on startup to initialize state should be okay.)
          Hide
          Tyler Hobbs added a comment -

          I think we do need to persist this one across restarts, e.g., in a system table. (Maybe just a Map would be fine, since pulling 1000s of entries into memory once on startup to initialize state should be okay.)

          I take it we will want to periodically sync the system table with the latest counts? (Just relying on some kind of shutdown hook seems like a bad idea.) Given that there can be 10's of thousands of sstables, I think we'll want to throttle that sync.

          By "a Map", I assume you mean a map column collection. I am a little concerned about pulling in a map with 10's of thousands of entries all at once, even if it's just on startup.

          As far clearing entries goes, directly deleting when the sstable is removed combined with a TTL of 10 days to handle odd cases seems reasonable to me.

          Thoughts?

          Show
          Tyler Hobbs added a comment - I think we do need to persist this one across restarts, e.g., in a system table. (Maybe just a Map would be fine, since pulling 1000s of entries into memory once on startup to initialize state should be okay.) I take it we will want to periodically sync the system table with the latest counts? (Just relying on some kind of shutdown hook seems like a bad idea.) Given that there can be 10's of thousands of sstables, I think we'll want to throttle that sync. By "a Map", I assume you mean a map column collection. I am a little concerned about pulling in a map with 10's of thousands of entries all at once, even if it's just on startup. As far clearing entries goes, directly deleting when the sstable is removed combined with a TTL of 10 days to handle odd cases seems reasonable to me. Thoughts?
          Hide
          Jonathan Ellis added a comment -

          Either time-based throttle or sync-on-compaction seems reasonable to me. Leaning towards the former since write:read ratio may be low.

          If we're not going to take the simple approach with a map, should we keep more data like this?

          CREATE TABLE sstable_activity (
              keyspace text,
              columnfamily text,
              generation int,
              hour_ending_at timestamp,
              reads int,
              PRIMARY KEY ((keyspace, columnfamily, generation), hour_ending_at)
          )
          

          Delete-on-removal + TTL sounds right.

          Show
          Jonathan Ellis added a comment - Either time-based throttle or sync-on-compaction seems reasonable to me. Leaning towards the former since write:read ratio may be low. If we're not going to take the simple approach with a map, should we keep more data like this? CREATE TABLE sstable_activity ( keyspace text, columnfamily text, generation int , hour_ending_at timestamp, reads int , PRIMARY KEY ((keyspace, columnfamily, generation), hour_ending_at) ) Delete-on-removal + TTL sounds right.
          Hide
          Tyler Hobbs added a comment -

          If we're not going to take the simple approach with a map, should we keep more data like this?

          That's what I was thinking with the exception of the hour_ending_at column; I was thinking we would periodically overwrite a single read count row per sstable instead of tracking it in time-series fashion. Are you specifically looking to have both "recent" read rates and total historic read rates? If so, just using two counters would be lighter weight. I don't foresee compaction strategies using more than the recent and total rates, but I suppose users might find full time series data useful.

          Show
          Tyler Hobbs added a comment - If we're not going to take the simple approach with a map, should we keep more data like this? That's what I was thinking with the exception of the hour_ending_at column; I was thinking we would periodically overwrite a single read count row per sstable instead of tracking it in time-series fashion. Are you specifically looking to have both "recent" read rates and total historic read rates? If so, just using two counters would be lighter weight. I don't foresee compaction strategies using more than the recent and total rates, but I suppose users might find full time series data useful.
          Hide
          Jonathan Ellis added a comment -

          Maybe get fancy and do a reservoir-sampling histogram? Metrics makes that easy.

          Show
          Jonathan Ellis added a comment - Maybe get fancy and do a reservoir-sampling histogram? Metrics makes that easy.
          Hide
          Tyler Hobbs added a comment -

          Maybe get fancy and do a reservoir-sampling histogram? Metrics makes that easy.

          I could see that being useful during steady state, but there doesn't appear to be an easy way to serialize and rebuild those, which either means we would need a somewhat custom implementation, or we wouldn't have reliable information on startup. (The custom implementation could support dump and restore, or perhaps just expose a notion of "confidence".)

          Perhaps we should spend some time looking at the use cases before investing too much effort in this?

          Show
          Tyler Hobbs added a comment - Maybe get fancy and do a reservoir-sampling histogram? Metrics makes that easy. I could see that being useful during steady state, but there doesn't appear to be an easy way to serialize and rebuild those, which either means we would need a somewhat custom implementation, or we wouldn't have reliable information on startup. (The custom implementation could support dump and restore, or perhaps just expose a notion of "confidence".) Perhaps we should spend some time looking at the use cases before investing too much effort in this?
          Hide
          Jonathan Ellis added a comment -

          Fair enough. Initially I only really care about supporting CASSANDRA-5519 and compaction making better decisions.

          Show
          Jonathan Ellis added a comment - Fair enough. Initially I only really care about supporting CASSANDRA-5519 and compaction making better decisions.
          Hide
          T Jake Luciani added a comment -

          Before I forget, this is the logic for leveldb's hotness threshold for compaction

          https://code.google.com/p/leveldb/source/browse/db/version_set.cc#606

          Show
          T Jake Luciani added a comment - Before I forget, this is the logic for leveldb's hotness threshold for compaction https://code.google.com/p/leveldb/source/browse/db/version_set.cc#606
          Hide
          Jonathan Ellis added a comment -

          Interesting that Hyperdex said they do better without that: http://hackingdistributed.com/2013/06/17/hyperleveldb/

          Show
          Jonathan Ellis added a comment - Interesting that Hyperdex said they do better without that: http://hackingdistributed.com/2013/06/17/hyperleveldb/
          Hide
          Tyler Hobbs added a comment -

          With that in mind, I'm leaning towards using Metric's Meters, which have pretty low overhead and would be be easy to dump and restore. The only flaw is that they are hard-coded with 1m, 5m, and 15m windows; for compaction decisions, I feel like it would be nice to have a multi-hour window, and the 1 and 5m are probably too short to be useful for CASSANDRA-5519 (if we do something similar to my suggestion on that ticket). We could of course subclass and hardcode with windows of our choosing.

          Show
          Tyler Hobbs added a comment - With that in mind, I'm leaning towards using Metric's Meters , which have pretty low overhead and would be be easy to dump and restore. The only flaw is that they are hard-coded with 1m, 5m, and 15m windows; for compaction decisions, I feel like it would be nice to have a multi-hour window, and the 1 and 5m are probably too short to be useful for CASSANDRA-5519 (if we do something similar to my suggestion on that ticket). We could of course subclass and hardcode with windows of our choosing.
          Hide
          T Jake Luciani added a comment -

          I'm not sure you need to track read across restarts, is there a compelling reason to do this?

          Show
          T Jake Luciani added a comment - I'm not sure you need to track read across restarts, is there a compelling reason to do this?
          Hide
          Tyler Hobbs added a comment -

          I'm not sure you need to track read across restarts, is there a compelling reason to do this?

          For 5519, it would be useful for determining appropriate initial sizes for the in-memory index summary. For compaction strategies, it would allow us to have more informed compactions immediately after startup.

          Show
          Tyler Hobbs added a comment - I'm not sure you need to track read across restarts, is there a compelling reason to do this? For 5519, it would be useful for determining appropriate initial sizes for the in-memory index summary. For compaction strategies, it would allow us to have more informed compactions immediately after startup.
          Hide
          T Jake Luciani added a comment -

          that makes sense.

          Show
          T Jake Luciani added a comment - that makes sense.
          Hide
          Tyler Hobbs added a comment -

          Attached patch 5515-2.0-v1.txt (and branch) adds a custom Meter-like class and tracks 15m and 2h average read rates for sstables, persisting them in a new table, system.sstable_activity.

          Unfortunately, I had to make an entirely new class for the the meter and EWMA due to various difficulties in subclassing/reusing those directly from Metrics. I based the classes on the 3.0.1 Meter and EWMA classes, although I simplified them somewhat. (I tried to push some changes upstream to make this easier in the future, but it looks like it won't make much of a difference.)

          My main uncertainty with this patch is where to hook into to delete rows from system.sstable_activity when an sstable is deleted, so any suggestions there would be appreciated.

          Also, I wasn't sure how configurable we wanted to make this in terms of the tick interval (5 seconds), the persistence interval (5 minutes), and throttling (100 writes/sec), so all of those are hard-coded for now. I think those values are reasonable, but it's hard to guess about.

          Show
          Tyler Hobbs added a comment - Attached patch 5515-2.0-v1.txt (and branch ) adds a custom Meter-like class and tracks 15m and 2h average read rates for sstables, persisting them in a new table, system.sstable_activity. Unfortunately, I had to make an entirely new class for the the meter and EWMA due to various difficulties in subclassing/reusing those directly from Metrics. I based the classes on the 3.0.1 Meter and EWMA classes, although I simplified them somewhat. (I tried to push some changes upstream to make this easier in the future, but it looks like it won't make much of a difference.) My main uncertainty with this patch is where to hook into to delete rows from system.sstable_activity when an sstable is deleted, so any suggestions there would be appreciated. Also, I wasn't sure how configurable we wanted to make this in terms of the tick interval (5 seconds), the persistence interval (5 minutes), and throttling (100 writes/sec), so all of those are hard-coded for now. I think those values are reasonable, but it's hard to guess about.
          Hide
          Jeremiah Jordan added a comment -

          Tyler Hobbs does CASSANDRA-6010 do what you need?

          Show
          Jeremiah Jordan added a comment - Tyler Hobbs does CASSANDRA-6010 do what you need?
          Hide
          Tyler Hobbs added a comment -

          Tyler Hobbs does CASSANDRA-6010 do what you need?

          Actually, I made use of that in this patch, although (awkwardly) DataTracker subscribes to its own notifications.

          Show
          Tyler Hobbs added a comment - Tyler Hobbs does CASSANDRA-6010 do what you need? Actually, I made use of that in this patch, although (awkwardly) DataTracker subscribes to its own notifications.
          Hide
          Jeremiah Jordan added a comment -

          K, just read your question, hadn't read the patch yet.

          Show
          Jeremiah Jordan added a comment - K, just read your question, hadn't read the patch yet.
          Hide
          Jonathan Ellis added a comment -
          • Could we put the clearSSTableReadMeter in CFS or SSTR? Or put it in SSTableDeletingDask instead of making it a notification at all. Alternatively, since it's just one row I'd lean towards just letting TTL take care of it.
          • Does TTL need to be so long if we're persisting every 5m?
          • It looks like having the increments in the read section of the iterators means we only increment in index lookup (getPosition) is successful. IMO we should increment before getPosition. May be cleaner to do this in collationController but iterator constructor also works.
          • Use Keyspace.SYSTEM_KS instead of hardcoding "system"
          • What can we do to restore coldness data from a snapshot?
          Show
          Jonathan Ellis added a comment - Could we put the clearSSTableReadMeter in CFS or SSTR? Or put it in SSTableDeletingDask instead of making it a notification at all. Alternatively, since it's just one row I'd lean towards just letting TTL take care of it. Does TTL need to be so long if we're persisting every 5m? It looks like having the increments in the read section of the iterators means we only increment in index lookup ( getPosition ) is successful. IMO we should increment before getPosition. May be cleaner to do this in collationController but iterator constructor also works. Use Keyspace.SYSTEM_KS instead of hardcoding "system" What can we do to restore coldness data from a snapshot?
          Hide
          Tyler Hobbs added a comment -

          Could we put the clearSSTableReadMeter in CFS or SSTR? Or put it in SSTableDeletingDask instead of making it a notification at all. Alternatively, since it's just one row I'd lean towards just letting TTL take care of it.

          Going with SSTR would result in duplicate notifications. I think SSTableDeletingTask is a good spot, I just wasn't sure if that would be appropriate. If the TTL was short (say, less than 1 day), I think that would be okay, but...

          Does TTL need to be so long if we're persisting every 5m?

          I considered making it smaller, but in the case where a node goes down for some amount of time, it would be nice to not lose all of the stats when it comes back up.

          It looks like having the increments in the read section of the iterators means we only increment in index lookup (getPosition) is successful. IMO we should increment before getPosition. May be cleaner to do this in collationController but iterator constructor also works.

          Just to be check, getPosition() failure indicates a BF false-positive, but we want to include those in the count? I can see this making sense for managing in-memory index summary sizes, but maybe not for STCS optimization. (It should be a small enough number not to matter much either way.)

          What can we do to restore coldness data from a snapshot?

          Not much with the current storage strategy. Do we have any workarounds/suggestions for restoring TTLed data in general?

          Show
          Tyler Hobbs added a comment - Could we put the clearSSTableReadMeter in CFS or SSTR? Or put it in SSTableDeletingDask instead of making it a notification at all. Alternatively, since it's just one row I'd lean towards just letting TTL take care of it. Going with SSTR would result in duplicate notifications. I think SSTableDeletingTask is a good spot, I just wasn't sure if that would be appropriate. If the TTL was short (say, less than 1 day), I think that would be okay, but... Does TTL need to be so long if we're persisting every 5m? I considered making it smaller, but in the case where a node goes down for some amount of time, it would be nice to not lose all of the stats when it comes back up. It looks like having the increments in the read section of the iterators means we only increment in index lookup (getPosition) is successful. IMO we should increment before getPosition. May be cleaner to do this in collationController but iterator constructor also works. Just to be check, getPosition() failure indicates a BF false-positive, but we want to include those in the count? I can see this making sense for managing in-memory index summary sizes, but maybe not for STCS optimization. (It should be a small enough number not to matter much either way.) What can we do to restore coldness data from a snapshot? Not much with the current storage strategy. Do we have any workarounds/suggestions for restoring TTLed data in general?
          Hide
          Jonathan Ellis added a comment -

          I can see this making sense for managing in-memory index summary sizes, but maybe not for STCS optimization.

          I actually think it makes sense for both – if there's a hot partition that BF isn't rejecting, then we should go ahead and compact it with other hot sstables even if it's being caused by a FP. (Also, note that it's a bit more complex than just BF FP – we can reject a slice from a partition that does exist, from cell index data in IndexedBlockFetcher.)

          I guess if you want we could increment once for an index read, again for a data read. But I'm not sure if that actually buys us anything useful.

          Do we have any workarounds/suggestions for restoring TTLed data in general?

          No. I was thinking more along the lines of writing out a summary file, or copying it into a snapshot metadata system table.

          Show
          Jonathan Ellis added a comment - I can see this making sense for managing in-memory index summary sizes, but maybe not for STCS optimization. I actually think it makes sense for both – if there's a hot partition that BF isn't rejecting, then we should go ahead and compact it with other hot sstables even if it's being caused by a FP. (Also, note that it's a bit more complex than just BF FP – we can reject a slice from a partition that does exist, from cell index data in IndexedBlockFetcher.) I guess if you want we could increment once for an index read, again for a data read. But I'm not sure if that actually buys us anything useful. Do we have any workarounds/suggestions for restoring TTLed data in general? No. I was thinking more along the lines of writing out a summary file, or copying it into a snapshot metadata system table.
          Hide
          Tyler Hobbs added a comment - - edited

          5515-2.0-v2.txt moves the row clearing to SSTableDeletingTask, increments reads in CollationController, and uses Keyspace.SYSTEM_KS.

          With regards to restoring from snapshots, I think the actual restore procedure might be somewhat complicated and of relatively little benefit compared to the current patch. Mind if we move that to another ticket?

          Show
          Tyler Hobbs added a comment - - edited 5515-2.0-v2.txt moves the row clearing to SSTableDeletingTask, increments reads in CollationController, and uses Keyspace.SYSTEM_KS. With regards to restoring from snapshots, I think the actual restore procedure might be somewhat complicated and of relatively little benefit compared to the current patch. Mind if we move that to another ticket?
          Hide
          Jonathan Ellis added a comment -

          Committed, with a small tweak of deletion logic to key off of whether readMeter == null, so only the constructor needs to care about what the logic is for that.

          Mind if we move that to another ticket?

          All right. YAGNI.

          Show
          Jonathan Ellis added a comment - Committed, with a small tweak of deletion logic to key off of whether readMeter == null, so only the constructor needs to care about what the logic is for that. Mind if we move that to another ticket? All right. YAGNI.

            People

            • Assignee:
              Tyler Hobbs
              Reporter:
              Jonathan Ellis
              Reviewer:
              Jonathan Ellis
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development