Cassandra
  1. Cassandra
  2. CASSANDRA-1625

make the row cache continuously durable

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None

      Description

      I was looking into how the row cache worked today and realized only row keys were saved and later pre-populated on start-up.

      On the premise that row caches are typically used for small rows of which there may be many, this is highly likely to be seek bound on large data sets during pre-population.

      The pre-population could be made faster by increasing I/O queue depth (by concurrency or by libaio as in 1576), but especially on large data sets the performance would be nowhere near what could be achieved if a reasonably sized file containing the actual rows were to be read in a sequential fashion on start.

      On the one hand, Cassandra's design means that this should be possible to do efficiently much easier than in some other cases, but on the other hand it is still not entirely trivial.

      The key problem with maintaining a continuously durable cache is that one must never read stale data on start-up. Stale could mean either data that was later deleted, or an old version of data that was updated.

      In the case of Cassandra, this means that any cache restored on start-up must be up-to-date with whatever position in the commit log that commit log recovery will start at. (Because the row cache is for an entire row, we can't couple updating of an on-disk row cache with memtable flushes.)

      I can see two main approaches:

      (a) Periodically dump the entire row cache, deferring commit log eviction in synchronization with said dumping.

      (b) Keep a change log of sorts, similar to the commit log but filtered to only contain data written to the commit log that affects keys that were in the row cache at the time. Eviction of commit logs or updating positional markers that affect the point of commit log recovery start, would imply fsync():ing this change log. An incremental traversal, or alternatively a periodic full dump, would have to be used to ensure that old row change log segments can be evicted without loss of cache warmness.

      I like (b), but it is also the introduction of significant complexity (and potential write path overhead) for the purpose of the row cache. In the worst case where hotly read data is also hotly written, the overhead could be particularly significant.

      I am not convinced whether this is a good idea for Cassandra, but I have a use-case where a similar cache might have to be written in the application to achieve the desired effect (pre-population being too slow for a sufficiently large row cache). But there are reasons why, in an ideal world, having such a continuously durable cache in Cassandra would be much better than something at the application level. The primary reason is that it does not interact poorly with consistency in the cluster, since the cache is node-local and appropriate measures would be taken to make it consistent locally on each node. I.e., it would be entirely transparent to the application.

      Thoughts? Like/dislike/too complex/not worth it?

        Issue Links

          Activity

          Hide
          Stu Hood added a comment -

          While I like the idea of optimizing disk IO needed to load recently read entries, "continuously durable cache" is an oxymoron, imo. Rather than a cache, it becomes a "top n view of base data ordered by recency", which is equally useful, but might make us approach the problem from a different angle.

          For instance, an alternative way to implement (B) would be to maintain an on-disk version of the in-memory cache in a ColumnFamily. The commitlog already implements buffering the random writes you would need: every time you write to the rowcache, you synchronously write the diff to the CF, which handles all of the buffering and collection for you. Preloading the row cache then becomes a table scan of that CF, which (except for compaction concerns) contains consecutive hot/cached data.

          But, this is a very similar effect to what we would get from 1608: if we are optimizing the on-disk layout for hot data, then cache-key snapshots can be loaded more quickly/consecutively. I think finding a solution to 1608 may make the random-access concerns here less pressing.

          Show
          Stu Hood added a comment - While I like the idea of optimizing disk IO needed to load recently read entries, "continuously durable cache" is an oxymoron, imo. Rather than a cache, it becomes a "top n view of base data ordered by recency", which is equally useful, but might make us approach the problem from a different angle. For instance, an alternative way to implement (B) would be to maintain an on-disk version of the in-memory cache in a ColumnFamily. The commitlog already implements buffering the random writes you would need: every time you write to the rowcache, you synchronously write the diff to the CF, which handles all of the buffering and collection for you. Preloading the row cache then becomes a table scan of that CF, which (except for compaction concerns) contains consecutive hot/cached data. But, this is a very similar effect to what we would get from 1608: if we are optimizing the on-disk layout for hot data, then cache-key snapshots can be loaded more quickly/consecutively. I think finding a solution to 1608 may make the random-access concerns here less pressing.
          Hide
          Peter Schuller added a comment -

          Interesting, that does sound far less invasive then what I proposed, although some measure needs to be taken to ensure stale data (with respect to concurrent writing) is never written to that CF.

          In general btw, an important observation is that while both my original and your proposal implies that reads can result in writes, the more effective the cache the less writes are required (assuming one does not preserve relative recency in the LRU one only needs to write on cache eviction to remove an entry and add a new one). Presumably if the cache is so important that one wants to ensure it is persistent on disk, the cache hit ratio is likely to be high, which means less writes per read.

          In other words, the write overhead of the cache decreases the more efficient the cache is.

          With respect to 1608 - my interpretation of that one is probably a bit weaker in the sense that I look at it as better prioritization when compaction happens as well as actively triggering compaction solely based on statistics from reads. While I can see how this will slowly over time tend to result in frequently accessed data being clustered in terms of sstables, I don't quite see how one would expect it to work so well as to make the pre-population not seek bound. I could see that being the case if 1608 implies keeping row-level statistics such that compactions could be triggered that separate hot data from cold data, I can see it making this obsolete.

          I guess such a discussion is better kept in 1608.

          Show
          Peter Schuller added a comment - Interesting, that does sound far less invasive then what I proposed, although some measure needs to be taken to ensure stale data (with respect to concurrent writing) is never written to that CF. In general btw, an important observation is that while both my original and your proposal implies that reads can result in writes, the more effective the cache the less writes are required (assuming one does not preserve relative recency in the LRU one only needs to write on cache eviction to remove an entry and add a new one). Presumably if the cache is so important that one wants to ensure it is persistent on disk, the cache hit ratio is likely to be high, which means less writes per read. In other words, the write overhead of the cache decreases the more efficient the cache is. With respect to 1608 - my interpretation of that one is probably a bit weaker in the sense that I look at it as better prioritization when compaction happens as well as actively triggering compaction solely based on statistics from reads. While I can see how this will slowly over time tend to result in frequently accessed data being clustered in terms of sstables, I don't quite see how one would expect it to work so well as to make the pre-population not seek bound. I could see that being the case if 1608 implies keeping row-level statistics such that compactions could be triggered that separate hot data from cold data, I can see it making this obsolete. I guess such a discussion is better kept in 1608.
          Hide
          Peter Schuller added a comment - - edited

          Synergy!

          I just realized that if we combine Stu's suggestion with:

          (1) CASSANDRA-1657 (in-memory cf)
          (2) Just have the CF be the "cache" (I agree with Stu's problem with calling it a cache) rather than being used only on start-up to populate the row cache
          (3) Ongoing improvements to compaction (whether it's fadvise/mincore or others) that eliminates the need to keep both the old and the new sstables in memory

          We now have, at least when it comes to eliminating disk I/O, a in-memory cache that is:

          (1) Memory efficient due to being backed by tightly packed sstables as usual (subject to compaction thresholds).
          (2) Not made up of objects in the heap, so no GC pressure trade-off (except for tracking recenticity information).
          (3) Under much more explicit control in terms of size; while it would not come for free, being able to say "have the cache be N MB large" would not be unrealistic.
          (4) Extremely efficiently instantly available all hot and cozy on start-up regardless of total data set size; no seek bound operations in sight. Theoretically as efficiently as an mmap() (edited: accidentally put mlock()) with MAP_POPULATE, depending on CASSANDRA-1657

          I say "at least when it comes to eliminating disk I/O" because presumably reading from such a CF is more expensive than using the currently existing row cache.

          The two approaches are not mutually exclusive though and could even be used together for the same CF if a situation should so demand.

          The lack of (3) means the memory efficiency is lessened, but in other respects the positive effects should remain.

          But wait, there is more, as they say: If we ignore CASSANDRA-1657 for a moment, and instead replace it by some logic to keep a given CF on a different path/mountpoint, Cassandra would now have the capability of keeping the hot set persistent on faster storage. In other words, this is direct support for utilizing a fast SSD in a system which otherwise has lots of bulk storage - without relying on something OS specific like ZFS L2ARC, and with, it seems, minimal (for the gains) implementation effort.

          This has me all excited, so I hope no one points out some obvious flaw in this plan

          Show
          Peter Schuller added a comment - - edited Synergy! I just realized that if we combine Stu's suggestion with: (1) CASSANDRA-1657 (in-memory cf) (2) Just have the CF be the "cache" (I agree with Stu's problem with calling it a cache) rather than being used only on start-up to populate the row cache (3) Ongoing improvements to compaction (whether it's fadvise/mincore or others) that eliminates the need to keep both the old and the new sstables in memory We now have, at least when it comes to eliminating disk I/O, a in-memory cache that is: (1) Memory efficient due to being backed by tightly packed sstables as usual (subject to compaction thresholds). (2) Not made up of objects in the heap, so no GC pressure trade-off (except for tracking recenticity information). (3) Under much more explicit control in terms of size; while it would not come for free, being able to say "have the cache be N MB large" would not be unrealistic. (4) Extremely efficiently instantly available all hot and cozy on start-up regardless of total data set size; no seek bound operations in sight. Theoretically as efficiently as an mmap() (edited: accidentally put mlock()) with MAP_POPULATE, depending on CASSANDRA-1657 I say "at least when it comes to eliminating disk I/O" because presumably reading from such a CF is more expensive than using the currently existing row cache. The two approaches are not mutually exclusive though and could even be used together for the same CF if a situation should so demand. The lack of (3) means the memory efficiency is lessened, but in other respects the positive effects should remain. But wait, there is more, as they say: If we ignore CASSANDRA-1657 for a moment, and instead replace it by some logic to keep a given CF on a different path/mountpoint, Cassandra would now have the capability of keeping the hot set persistent on faster storage. In other words, this is direct support for utilizing a fast SSD in a system which otherwise has lots of bulk storage - without relying on something OS specific like ZFS L2ARC, and with, it seems, minimal (for the gains) implementation effort. This has me all excited, so I hope no one points out some obvious flaw in this plan
          Hide
          Stu Hood added a comment -

          Was scrolling through old tickets and came across this one.

          With the pluggable cache support we have now, I think someone could create an implementation that used this strategy: the in-process representation of the cache would be a CFStore of only the hot data, and an LRU list of keys. Members of the LRU list would be stored in the CFStore, so a key falling off the end of the LRU would cause a write to delete that entry from the on-disk representation.

          Then again, in a post-1608 world, who knows if this is necessary?

          Show
          Stu Hood added a comment - Was scrolling through old tickets and came across this one. With the pluggable cache support we have now, I think someone could create an implementation that used this strategy: the in-process representation of the cache would be a CFStore of only the hot data, and an LRU list of keys. Members of the LRU list would be stored in the CFStore, so a key falling off the end of the LRU would cause a write to delete that entry from the on-disk representation. Then again, in a post-1608 world, who knows if this is necessary?
          Hide
          Jonathan Ellis added a comment -

          Storing the full cache tuple (and evicting entries that turn out to be obsolete lazily) was done in CASSANDRA-1625

          Show
          Jonathan Ellis added a comment - Storing the full cache tuple (and evicting entries that turn out to be obsolete lazily) was done in CASSANDRA-1625

            People

            • Assignee:
              Unassigned
              Reporter:
              Peter Schuller
            • Votes:
              3 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development