Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-10855

Use Caffeine (W-TinyLFU) for on-heap caches

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: 4.0
    • Component/s: None
    • Labels:

      Description

      Cassandra currently uses ConcurrentLinkedHashMap for performance critical caches (key, counter) and Guava's cache for non-critical (auth, metrics, security). All of these usages have been replaced by Caffeine, written by the author of the previously mentioned libraries.

      The primary incentive is to switch from LRU policy to W-TinyLFU, which provides near optimal hit rates. It performs particularly well in database and search traces, is scan resistant, and as adds a very small time/space overhead to LRU.

      Secondarily, Guava's caches never obtained similar performance to CLHM due to some optimizations not being ported over. This change results in faster reads and not creating garbage as a side-effect.

      1. CASSANDRA-10855.patch
        1.98 MB
        Ben Manes
      2. CASSANDRA-10855.patch
        1.02 MB
        Ben Manes

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user ben-manes opened a pull request:

          https://github.com/apache/cassandra/pull/59

          CASSANDRA-10855: Use Caffeine for on-heap caches

          [Caffeine](https://github.com/ben-manes/caffeine) is a Java 8 cache by the author of ConcurrentLinkedHashMap and Guava's cache, which is what was previously used. For the key and counter caches, the CLHM-based cache remains as a fallback if needed, but is deprecated and scheduled for removal.

          CLHM and Guava uses the LRU policy due to its simplicity, decent hit rate, and known characteristics. Caffeine uses a near optimal policy, [W-TinyLFU](http://arxiv.org/pdf/1512.00727.pdf), which has a significantly [higher hit rate](https://github.com/ben-manes/caffeine/wiki/Efficiency). In particular the hit rate in the paper shows a substantial gain on database and search workloads.

          The performance between CLHM and Caffeine caches should be similar, with some possible gains in write throughput. Significant gains may be observed from Guava, due to it never porting over some optimizations that improve read throughput and avoid creating garbage as a side effect.

          For a brief overview, see this [article](https://www.voxxed.com/blog/2015/12/add-a-boost-of-caffeine-to-your-java). This pull request is being tracked in CASSANDRA-10855(https://issues.apache.org/jira/browse/CASSANDRA-10855).

          @jbellis @snazy

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/ben-manes/cassandra caffeine

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/cassandra/pull/59.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #59


          commit 3435a5a2b6eee30210350d5dfe1f76203d9d0f35
          Author: Ben Manes <ben.manes@gmail.com>
          Date: 2015-12-14T06:22:04Z

          Use Caffeine for key and counter on-heap caches

          Caffeine is a Java 8 cache by the author of ConcurrentLinkedHashMap,
          which is what was previously used. That cache remains as a fallback
          if needed, but is deprecated and scheduled for removal.

          CLHM uses the LRU policy due to its simplicity, decent hit rate, and
          known characteristics. Caffeine uses a near optimal policy, W-TinyLFU,
          which has a significantly higher hit rate. In particular the hit rate
          in the paper shows a substantial gain on database and search workloads.

          The performance between the two caches should be similar, with some
          possible gains in write throughput.

          Caffeine: https://github.com/ben-manes/caffeine
          W-TinyLFU: http://arxiv.org/pdf/1512.00727.pdf
          Overview: https://www.voxxed.com/blog/2015/12/add-a-boost-of-caffeine-to-your-java

          commit 91ac1fbf6e38704597befce9c280c516d61fb07a
          Author: Ben Manes <ben.manes@gmail.com>
          Date: 2015-12-14T06:44:17Z

          Switch usage of Guava caches to Caffeine


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user ben-manes opened a pull request: https://github.com/apache/cassandra/pull/59 CASSANDRA-10855 : Use Caffeine for on-heap caches [Caffeine] ( https://github.com/ben-manes/caffeine ) is a Java 8 cache by the author of ConcurrentLinkedHashMap and Guava's cache, which is what was previously used. For the key and counter caches, the CLHM-based cache remains as a fallback if needed, but is deprecated and scheduled for removal. CLHM and Guava uses the LRU policy due to its simplicity, decent hit rate, and known characteristics. Caffeine uses a near optimal policy, [W-TinyLFU] ( http://arxiv.org/pdf/1512.00727.pdf ), which has a significantly [higher hit rate] ( https://github.com/ben-manes/caffeine/wiki/Efficiency ). In particular the hit rate in the paper shows a substantial gain on database and search workloads. The performance between CLHM and Caffeine caches should be similar, with some possible gains in write throughput. Significant gains may be observed from Guava, due to it never porting over some optimizations that improve read throughput and avoid creating garbage as a side effect. For a brief overview, see this [article] ( https://www.voxxed.com/blog/2015/12/add-a-boost-of-caffeine-to-your-java ). This pull request is being tracked in CASSANDRA-10855 ( https://issues.apache.org/jira/browse/CASSANDRA-10855 ). @jbellis @snazy You can merge this pull request into a Git repository by running: $ git pull https://github.com/ben-manes/cassandra caffeine Alternatively you can review and apply these changes as the patch at: https://github.com/apache/cassandra/pull/59.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #59 commit 3435a5a2b6eee30210350d5dfe1f76203d9d0f35 Author: Ben Manes <ben.manes@gmail.com> Date: 2015-12-14T06:22:04Z Use Caffeine for key and counter on-heap caches Caffeine is a Java 8 cache by the author of ConcurrentLinkedHashMap, which is what was previously used. That cache remains as a fallback if needed, but is deprecated and scheduled for removal. CLHM uses the LRU policy due to its simplicity, decent hit rate, and known characteristics. Caffeine uses a near optimal policy, W-TinyLFU, which has a significantly higher hit rate. In particular the hit rate in the paper shows a substantial gain on database and search workloads. The performance between the two caches should be similar, with some possible gains in write throughput. Caffeine: https://github.com/ben-manes/caffeine W-TinyLFU: http://arxiv.org/pdf/1512.00727.pdf Overview: https://www.voxxed.com/blog/2015/12/add-a-boost-of-caffeine-to-your-java commit 91ac1fbf6e38704597befce9c280c516d61fb07a Author: Ben Manes <ben.manes@gmail.com> Date: 2015-12-14T06:44:17Z Switch usage of Guava caches to Caffeine
          Hide
          ben.manes Ben Manes added a comment -

          See the pull request for this task.

          Show
          ben.manes Ben Manes added a comment - See the pull request for this task.
          Hide
          aweisberg Ariel Weisberg added a comment -

          I don't object to swapping CLHM with a new implementation. As a matter of principle we should measure what impact this has so we can understand the yield of the optimizations in tiny LFU. We shouldn't be making changes for performance without measuring, but also so we can inform the evolution of OHC which solves the GC pressure issue.

          I see on heap caches, memtables, and any other source of high volume survivor copying and promotion as a dead end.

          We changed the format of the key cache for 3.0 so that it would be suitable for memory mapping, but didn't get the change in to actually use a memory mapped access method. Robert Stupp did write a lot of the code, but it was still using OHC as the cache. I suspect that will get us the 80%, but someone has to pick that up. Joshua McKenzie we should think about if we want to schedule if we want to boost read performance and tail latencies due to reduced heap pressure.

          Show
          aweisberg Ariel Weisberg added a comment - I don't object to swapping CLHM with a new implementation. As a matter of principle we should measure what impact this has so we can understand the yield of the optimizations in tiny LFU. We shouldn't be making changes for performance without measuring, but also so we can inform the evolution of OHC which solves the GC pressure issue. I see on heap caches, memtables, and any other source of high volume survivor copying and promotion as a dead end. We changed the format of the key cache for 3.0 so that it would be suitable for memory mapping, but didn't get the change in to actually use a memory mapped access method. Robert Stupp did write a lot of the code, but it was still using OHC as the cache. I suspect that will get us the 80%, but someone has to pick that up. Joshua McKenzie we should think about if we want to schedule if we want to boost read performance and tail latencies due to reduced heap pressure.
          Hide
          ben.manes Ben Manes added a comment -

          A little background from private discussions with Jonathan and Robert leading to this ticket. If the performance analysis is positive then it provides a strong motivation to integrate W-TinyLFU into OHC. As Robert described it, the on-heap caches are low hanging fruit where the key cache is performance critical.

          These caching projects have always been on my personal time as a research hobby. My original interest was on concurrency, since all other caches either use coarse locking or make sacrifices for throughput (e.g. lower hit rates, O eviction). More recently I worked on improving hit rates, as described in that paper. I've tended to keep a narrow focus until solving a particularly hard problem and off-heap adds more complexity so it wasn't tackled. I'm a little disappointed that OHC didn't borrow ideas from CLHM, as the fundamentals are transferable, but perhaps we'll combine them all together in a future project.

          Show
          ben.manes Ben Manes added a comment - A little background from private discussions with Jonathan and Robert leading to this ticket. If the performance analysis is positive then it provides a strong motivation to integrate W-TinyLFU into OHC. As Robert described it, the on-heap caches are low hanging fruit where the key cache is performance critical. These caching projects have always been on my personal time as a research hobby. My original interest was on concurrency, since all other caches either use coarse locking or make sacrifices for throughput (e.g. lower hit rates, O eviction). More recently I worked on improving hit rates, as described in that paper. I've tended to keep a narrow focus until solving a particularly hard problem and off-heap adds more complexity so it wasn't tackled. I'm a little disappointed that OHC didn't borrow ideas from CLHM, as the fundamentals are transferable, but perhaps we'll combine them all together in a future project.
          Hide
          snazy Robert Stupp added a comment -

          Ran CI against (rebased) branch and the results look good to me (i.e. no regression).

          Unfortunately the cstar perf runs (trades-fwd-lcs-nolz4 and cassci regression test r/w) show that using Caffeine for the key cache slightly degrades performance in terms of throughput and latencies. Some percentiles (mostly max latencies) are slightly better, but the overall result is that performance degrades. The key-cache hit rate is slightly better with Caffeine (trades-fwd-lcs-nolz4 showing slightly more than 10% hit rate w/ Caffeine vs. slightly less than 10% w/o Caffeine).

          trades-fwd-lcs-nolz4 uses somewhat bigger partitions and fills the key cache completely.
          regression r/w uses small partitions and just uses roughly 10% of the key cache.
          perf runs used a 3-node C* cluster (“blade_11_b”) - each node having 2 6-code Xeon CPUs having a total of 64GB RAM.

          From a really quick & brief view at the Caffeine source, I suspect that the worse numbers are caused by the spinning loops. Also padding fields, which can behave completely different on NUMA than on singe-CPU systems, may have some bad influence in this test.

          Show
          snazy Robert Stupp added a comment - Ran CI against (rebased) branch and the results look good to me (i.e. no regression). Unfortunately the cstar perf runs ( trades-fwd-lcs-nolz4 and cassci regression test r/w ) show that using Caffeine for the key cache slightly degrades performance in terms of throughput and latencies. Some percentiles (mostly max latencies) are slightly better, but the overall result is that performance degrades. The key-cache hit rate is slightly better with Caffeine (trades-fwd-lcs-nolz4 showing slightly more than 10% hit rate w/ Caffeine vs. slightly less than 10% w/o Caffeine). trades-fwd-lcs-nolz4 uses somewhat bigger partitions and fills the key cache completely. regression r/w uses small partitions and just uses roughly 10% of the key cache. perf runs used a 3-node C* cluster (“blade_11_b”) - each node having 2 6-code Xeon CPUs having a total of 64GB RAM. From a really quick & brief view at the Caffeine source, I suspect that the worse numbers are caused by the spinning loops. Also padding fields, which can behave completely different on NUMA than on singe-CPU systems, may have some bad influence in this test.
          Hide
          ben.manes Ben Manes added a comment -

          Do you know why the hit rate is so poor? It being low regardless of policy is troubling. So it seems important to figure out why that is and what (if anything) can be done to improve that.

          An ineffective cache impacts the design assumptions by making writes (not reads) the common case. The slightly heavier policy, default delegation to FJP, etc would all be more noticeable when writes are the dominate behavior. I'm not sure off-hand why its worse, but that seems secondary to fixing the hit rate problem.

          Show
          ben.manes Ben Manes added a comment - Do you know why the hit rate is so poor? It being low regardless of policy is troubling. So it seems important to figure out why that is and what (if anything) can be done to improve that. An ineffective cache impacts the design assumptions by making writes (not reads) the common case. The slightly heavier policy, default delegation to FJP, etc would all be more noticeable when writes are the dominate behavior. I'm not sure off-hand why its worse, but that seems secondary to fixing the hit rate problem.
          Hide
          aweisberg Ariel Weisberg added a comment -

          Is this testing with a uniform access distribution? We keep doing that because it's the default and I think it's a really bad way to measure as a default since it's the exception not the common case.

          Show
          aweisberg Ariel Weisberg added a comment - Is this testing with a uniform access distribution? We keep doing that because it's the default and I think it's a really bad way to measure as a default since it's the exception not the common case.
          Hide
          ben.manes Ben Manes added a comment -

          Happy new year. Anything I can do to help keep this moving?

          Ariel's comment explains the poor hit rate, as a uniform distribution will result in a fixed and low hit rate regardless of policy. An effective cache is often at around 85%, ideally in the high 90s to make reads the dominant case, but even 65% is useful. Even when the hit rate is maxed out, the effect of a better policy can be noticeable. In that case it reduces the TCO by being able to achieve the same performance with smaller, cheaper machines.

          Glancing at the uniform results the degredation is small enough to probably be within the margin of error where the run and other system effects dominate. In an update heavy workload the new cache should be faster due to synchronization having less penalty than CAS storms. But on the perf test's insertion heavy workload it is probably a little slower due to features incurring more complexity. Another set of eyes might uncover some improvements, so that's always welcome.

          Zipf-like distributions are considered the most common workload patterns. Ideally we could capture a production trace and simulate it, as the database trace I use shows very promising results.

          Show
          ben.manes Ben Manes added a comment - Happy new year. Anything I can do to help keep this moving? Ariel's comment explains the poor hit rate, as a uniform distribution will result in a fixed and low hit rate regardless of policy. An effective cache is often at around 85%, ideally in the high 90s to make reads the dominant case, but even 65% is useful. Even when the hit rate is maxed out, the effect of a better policy can be noticeable. In that case it reduces the TCO by being able to achieve the same performance with smaller, cheaper machines. Glancing at the uniform results the degredation is small enough to probably be within the margin of error where the run and other system effects dominate. In an update heavy workload the new cache should be faster due to synchronization having less penalty than CAS storms. But on the perf test's insertion heavy workload it is probably a little slower due to features incurring more complexity. Another set of eyes might uncover some improvements, so that's always welcome. Zipf-like distributions are considered the most common workload patterns. Ideally we could capture a production trace and simulate it, as the database trace I use shows very promising results.
          Hide
          snazy Robert Stupp added a comment -

          It would help if we would have some more workloads as cassandra-stress profiles. Could you help with that?

          BTW, I have to correct my last statement partly: trades-fwd-lcs-nolz4 has the 99.x% key-cache hit ration (using 10% of the key cache's capacity) and, regression r/w has the bad, ~10% hit ratio. (Sorry for the confusion, I accidentally swapped the console log files to get the hit ratios)

          trades-fwd-lcs-nolz4 Operation 1 (mixed writes+reads) gives a nice perf improvement during the first seconds but then egalizes with the existing implementation. Operation 2, 4 and 6 (all just reads) show a slight regression or no difference.
          OTOH cassci regression test r/w shows the perf regression of about 5%.

          Show
          snazy Robert Stupp added a comment - It would help if we would have some more workloads as cassandra-stress profiles. Could you help with that? BTW, I have to correct my last statement partly: trades-fwd-lcs-nolz4 has the 99.x% key-cache hit ration (using 10% of the key cache's capacity) and, regression r/w has the bad, ~10% hit ratio. (Sorry for the confusion, I accidentally swapped the console log files to get the hit ratios) trades-fwd-lcs-nolz4 Operation 1 (mixed writes+reads) gives a nice perf improvement during the first seconds but then egalizes with the existing implementation. Operation 2 , 4 and 6 (all just reads) show a slight regression or no difference. OTOH cassci regression test r/w shows the perf regression of about 5%.
          Hide
          ben.manes Ben Manes added a comment -

          cstar returns "Internal Server Error" so I can't look at the moment.

          I use YCSB's distributions for synthetic workloads. I'd expect adding those would be trivial.

          If the cache doesn't reach 50% capacity then Caffeine doesn't even initialize the frequency sketch to save memory (e.g. someone sets the max to a ridiculous threshold as a worst case bound). It could probably also by avoid the LRU shuffling, which would reduce the maintenance penalty.

          Show
          ben.manes Ben Manes added a comment - cstar returns "Internal Server Error" so I can't look at the moment. I use YCSB's distributions for synthetic workloads. I'd expect adding those would be trivial. If the cache doesn't reach 50% capacity then Caffeine doesn't even initialize the frequency sketch to save memory (e.g. someone sets the max to a ridiculous threshold as a worst case bound). It could probably also by avoid the LRU shuffling, which would reduce the maintenance penalty.
          Hide
          ben.manes Ben Manes added a comment -

          The latest snapshot jar includes a two optimizations.

          Insertions now avoid an unnecessary lambda. I suspect that will have a negligible benefit, but its always good to be more GC hygienic.

          A cache below 50% capacity will skip read policy work. That means it won't record the access in ring buffers which reduces contention. That also reduces the how often policy the maintenance work is scheduled, as the buffers don't need to be drained. A write will still trigger a maintenance cycle, but that should be shorter by doing less. This result in throughput close to a raw ConcurrentHashMap and then incurring the penalty when the threshold is crossed. That should improve trades-fwd-lcs-nolz4 and anyone else's usage where the cache is merely a safety threshold but isn't likely to grow close to the maximum.

          Show
          ben.manes Ben Manes added a comment - The latest snapshot jar includes a two optimizations. Insertions now avoid an unnecessary lambda. I suspect that will have a negligible benefit, but its always good to be more GC hygienic. A cache below 50% capacity will skip read policy work. That means it won't record the access in ring buffers which reduces contention. That also reduces the how often policy the maintenance work is scheduled, as the buffers don't need to be drained. A write will still trigger a maintenance cycle, but that should be shorter by doing less. This result in throughput close to a raw ConcurrentHashMap and then incurring the penalty when the threshold is crossed. That should improve trades-fwd-lcs-nolz4 and anyone else's usage where the cache is merely a safety threshold but isn't likely to grow close to the maximum.
          Hide
          jeromatron Jeremy Hanna added a comment -

          As part of testing, should we also fix cstar perf relating to cache to no longer be uniform access distribution? It seems like for this and future tests, that would give a more real-world result.

          Show
          jeromatron Jeremy Hanna added a comment - As part of testing, should we also fix cstar perf relating to cache to no longer be uniform access distribution? It seems like for this and future tests, that would give a more real-world result.
          Hide
          jbellis Jonathan Ellis added a comment -

          (cstar is back.)

          Show
          jbellis Jonathan Ellis added a comment - (cstar is back.)
          Hide
          ben.manes Ben Manes added a comment -

          2.1.0 was released which includes the above mentioned optimizations. So the cache should seem artificially better for an artificial workload =)

          I haven't had a chance to try to add more workload profiles to the Cassandra tool. It would be nice if we knew what real-world distributions were like, as Zipf-like is what researchers published. From the traces I've experimented with, I am fairly confident in a net positive result.

          P.S. An article on HighScalability describes the overall design.

          Show
          ben.manes Ben Manes added a comment - 2.1.0 was released which includes the above mentioned optimizations. So the cache should seem artificially better for an artificial workload =) I haven't had a chance to try to add more workload profiles to the Cassandra tool. It would be nice if we knew what real-world distributions were like, as Zipf-like is what researchers published. From the traces I've experimented with, I am fairly confident in a net positive result. P.S. An article on HighScalability describes the overall design.
          Hide
          aweisberg Ariel Weisberg added a comment -

          stress supports Weibull distribution so you shouldn't need to do a code change to find an alternative to uniform. Not sure if Weibull can fill the same niche as Zipf.

          Show
          aweisberg Ariel Weisberg added a comment - stress supports Weibull distribution so you shouldn't need to do a code change to find an alternative to uniform. Not sure if Weibull can fill the same niche as Zipf.
          Hide
          benedict Benedict added a comment -

          Zipf could easily be added. I added Weibull as the extreme value distribution without thinking too much about things at the time, and still don't really understand (nor have the time to) probability distributions. Both are used to model both incidence and sizes, but in differing circumstances. Zipf originates from language, but Weibull distribution apparently better models iconographic languages!

          It does seem that the literature standardizes on Zipf for incidence for things like access patterns. I'm pretty sure Weibull would also be fine, but it is perhaps better suited for generating extreme values (such as size of payload), but even this could happily be dealt with by Zipf, so perhaps we should just switch out Weibull entirely.

          Show
          benedict Benedict added a comment - Zipf could easily be added. I added Weibull as the extreme value distribution without thinking too much about things at the time, and still don't really understand (nor have the time to) probability distributions. Both are used to model both incidence and sizes, but in differing circumstances. Zipf originates from language, but Weibull distribution apparently better models iconographic languages! It does seem that the literature standardizes on Zipf for incidence for things like access patterns. I'm pretty sure Weibull would also be fine, but it is perhaps better suited for generating extreme values (such as size of payload), but even this could happily be dealt with by Zipf, so perhaps we should just switch out Weibull entirely.
          Hide
          ben.manes Ben Manes added a comment -

          I think Weibull should be fine, too.

          I rebased the pull request. I don't have permissions to schedule a run on cstar (nor do I think they should they be granted). It would be nice if Robert or another team member ran another round of performance tests. My expectation is that in a scenario similar to a realistic workload there will be a less I/O due to fewer cache misses. If there is a slight degredation (or gain) in other aspects of the new cache that should have a negligible impact, as reduced I/O will be the dominant effect. This would then provide a good argument for OHC to revisit its eviction policy.

          Show
          ben.manes Ben Manes added a comment - I think Weibull should be fine, too. I rebased the pull request. I don't have permissions to schedule a run on cstar (nor do I think they should they be granted). It would be nice if Robert or another team member ran another round of performance tests. My expectation is that in a scenario similar to a realistic workload there will be a less I/O due to fewer cache misses. If there is a slight degredation (or gain) in other aspects of the new cache that should have a negligible impact, as reduced I/O will be the dominant effect. This would then provide a good argument for OHC to revisit its eviction policy.
          Hide
          blerer Benjamin Lerer added a comment -

          Robert Stupp did you had the chance to schedule a run on cstar?

          Show
          blerer Benjamin Lerer added a comment - Robert Stupp did you had the chance to schedule a run on cstar?
          Hide
          snazy Robert Stupp added a comment -

          No, didn't had a chance to do it now. AFAIU it's not just about setting up a new run, it's about setting up a new, realistic workload.

          Show
          snazy Robert Stupp added a comment - No, didn't had a chance to do it now. AFAIU it's not just about setting up a new run, it's about setting up a new, realistic workload.
          Hide
          ben.manes Ben Manes added a comment -

          It sounds like we all thought Weibull was a good choice. Another option is to use YCSB which handles correlated omission and is popular benchmark for comparing data stores.

          Show
          ben.manes Ben Manes added a comment - It sounds like we all thought Weibull was a good choice. Another option is to use YCSB which handles correlated omission and is popular benchmark for comparing data stores.
          Hide
          ben.manes Ben Manes added a comment -

          Some unfortunate duplication of work on the DataStax side, where a custom LIRS page cache was implemented for evaluation.

          HBase (and Accumulo) are evaluating a migration from their custom SLRU on-heap caches. That might add some insight that could aid in Cassandra's decision.

          Show
          ben.manes Ben Manes added a comment - Some unfortunate duplication of work on the DataStax side, where a custom LIRS page cache was implemented for evaluation. HBase (and Accumulo) are evaluating a migration from their custom SLRU on-heap caches. That might add some insight that could aid in Cassandra's decision.
          Hide
          benedict Benedict added a comment -

          Since I may be partially to blame for that, having made a comment suggesting LIRS quite some time ago, I'd like to chip in my two cents:

          The employment of probabilistic data structures for a cache is the obvious next step in their evolution (something I had a hope to explore before Tiny-LFU showed up), and Tiny-LFU being the first such scheme should absolutely be considered strongly. It is simply natural that a probabilistic approach should be capable of yielding better returns for a probabilistic application (hit rate), and if anything I'm surprised the world has been so slow to exploit this idea. Were I making the comment again today, I would have suggested that Tiny-LFU be considered as the most sensible choice.

          That goes doubly for the fact that any given implementation is a stop-gap until the transition to Thread-Per-Core completes and we can make all of our algorithms more memory (and execution) efficient. Since Tiny-LFU already exists, it does seem an obvious choice on that front.

          Show
          benedict Benedict added a comment - Since I may be partially to blame for that, having made a comment suggesting LIRS quite some time ago, I'd like to chip in my two cents: The employment of probabilistic data structures for a cache is the obvious next step in their evolution (something I had a hope to explore before Tiny-LFU showed up), and Tiny-LFU being the first such scheme should absolutely be considered strongly . It is simply natural that a probabilistic approach should be capable of yielding better returns for a probabilistic application (hit rate), and if anything I'm surprised the world has been so slow to exploit this idea. Were I making the comment again today, I would have suggested that Tiny-LFU be considered as the most sensible choice. That goes doubly for the fact that any given implementation is a stop-gap until the transition to Thread-Per-Core completes and we can make all of our algorithms more memory (and execution) efficient. Since Tiny-LFU already exists, it does seem an obvious choice on that front.
          Hide
          ben.manes Ben Manes added a comment -

          When Cassandra moves to TPC, it should take less than a day to write a TinyLFU cache if you borrow my 4-bit CountMin sketch. I might evolve that to include the doorkeeper and other tricks, but it hasn't been a priority yet. The simulator includes a variant using incremental reset (which would be my preference for a TPC) and shows a negligible difference in hit rates. A Go developer read the paper and wrote an implementation for fun in a few days, and I'm happy to review anyone's version.

          Caffeine tackled the research and concurrency side, but no reason to adopt it once TPC. Best to take ideas, leverage its simulator to fine tune, and share insights.

          Show
          ben.manes Ben Manes added a comment - When Cassandra moves to TPC, it should take less than a day to write a TinyLFU cache if you borrow my 4-bit CountMin sketch . I might evolve that to include the doorkeeper and other tricks, but it hasn't been a priority yet. The simulator includes a variant using incremental reset (which would be my preference for a TPC) and shows a negligible difference in hit rates. A Go developer read the paper and wrote an implementation for fun in a few days, and I'm happy to review anyone's version. Caffeine tackled the research and concurrency side, but no reason to adopt it once TPC. Best to take ideas, leverage its simulator to fine tune, and share insights.
          Hide
          benedict Benedict added a comment -

          Yes, I agree it would not be difficult, and the rest of my statement holds for the post-TPC world - I would still think Tiny-LFU likely the best choice. That aspect of my comment was only meant to serve as an extra point in favour of a near-zero-cost solution in the meantime, saving any development expenditure for the more long-term solution.

          Show
          benedict Benedict added a comment - Yes, I agree it would not be difficult, and the rest of my statement holds for the post-TPC world - I would still think Tiny-LFU likely the best choice. That aspect of my comment was only meant to serve as an extra point in favour of a near-zero-cost solution in the meantime, saving any development expenditure for the more long-term solution.
          Hide
          benedict Benedict added a comment - - edited

          We should definitely address Branimir's comment before merging. As he says, this could theoretically completely eliminate the benefit of the cache in certain circumstances.

          I think both supporting larger hashes, and not comparing only against the main eviction candidate for admission (e.g. comparing against a value with logarithmically random distance from the eviction candidate), are reasonably easy and should solve the problem

          Show
          benedict Benedict added a comment - - edited We should definitely address Branimir's comment before merging. As he says, this could theoretically completely eliminate the benefit of the cache in certain circumstances. I think both supporting larger hashes, and not comparing only against the main eviction candidate for admission (e.g. comparing against a value with logarithmically random distance from the eviction candidate), are reasonably easy and should solve the problem
          Hide
          ben.manes Ben Manes added a comment -

          We added protection against a HashDoS attack, thanks for Branimir's and Benedict's help. I upgraded the pull request to use this version.

          I think we're good to merge.

          Show
          ben.manes Ben Manes added a comment - We added protection against a HashDoS attack, thanks for Branimir's and Benedict's help. I upgraded the pull request to use this version. I think we're good to merge.
          Hide
          ben.manes Ben Manes added a comment -

          Branimir Lambov integrated Caffeine (v2.2.6) for the chunk cache (CASSANDRA-5863 ). He included an analysis demonstrated good performance and hit rates. Thanks!

          Note that it was in version 2.2.7 that, thanks to Branimir Lambov and Benedict, we added strong protection against HashDOS attacks. I am currently exploring an adaptive version of the policy that improves its hit rate for small, recency-skewed caches. This would also naturally resolve the attack without needing our protection scheme.

          The chunk cache uses a same-thread executor to delegate the maintenance work to. While there might be a performance gain of using ForkJoinPool#commonPool (default), this was also a very wise choice. Druid was recently struck by JDK-8078490 where a race in 8u40 - 8u60 causes the pool to not execute the task.

          With Branimir Lambov's work proving the benefits, can we move this forward for the remaining caches? Afterwards I'd like to work with Robert Stupp on adding the policy to OHC to improve Cassandra's hit rates there too.

          Show
          ben.manes Ben Manes added a comment - Branimir Lambov integrated Caffeine (v2.2.6) for the chunk cache ( CASSANDRA-5863 ). He included an analysis demonstrated good performance and hit rates. Thanks! Note that it was in version 2.2.7 that, thanks to Branimir Lambov and Benedict , we added strong protection against HashDOS attacks. I am currently exploring an adaptive version of the policy that improves its hit rate for small, recency-skewed caches. This would also naturally resolve the attack without needing our protection scheme. The chunk cache uses a same-thread executor to delegate the maintenance work to. While there might be a performance gain of using ForkJoinPool#commonPool (default), this was also a very wise choice. Druid was recently struck by JDK-8078490 where a race in 8u40 - 8u60 causes the pool to not execute the task. With Branimir Lambov 's work proving the benefits, can we move this forward for the remaining caches? Afterwards I'd like to work with Robert Stupp on adding the policy to OHC to improve Cassandra's hit rates there too.
          Hide
          ben.manes Ben Manes added a comment -

          Any remaining blockers?

          Show
          ben.manes Ben Manes added a comment - Any remaining blockers?
          Hide
          snazy Robert Stupp added a comment -

          Haven't looked thoroughly through the code yet. One question: lib/caffeine-2.3.3.jar has been changed in the patch. Is it a new version of caffeine?

          Show
          snazy Robert Stupp added a comment - Haven't looked thoroughly through the code yet. One question: lib/caffeine-2.3.3.jar has been changed in the patch. Is it a new version of caffeine?
          Hide
          ben.manes Ben Manes added a comment -

          I rebased and updated the jar in the PR. It's the same as our previous discussion. The upgrade is maintenance improvements since 2.2.6 in use.

          Show
          ben.manes Ben Manes added a comment - I rebased and updated the jar in the PR. It's the same as our previous discussion. The upgrade is maintenance improvements since 2.2.6 in use.
          Hide
          snazy Robert Stupp added a comment -

          Just asking because the jar's been changed but not the version (looks like a change to a released version).

          Show
          snazy Robert Stupp added a comment - Just asking because the jar's been changed but not the version (looks like a change to a released version).
          Hide
          ben.manes Ben Manes added a comment -

          Thanks. Fixed by updating the build.xml and removing the old jar. That explodes the patch due to the jars checked into the repo. The linked PR mirrors it.

          I'd also like to pair on evaluating TinyLFU for OHC, but would like to see this go in before we jump into that.

          Show
          ben.manes Ben Manes added a comment - Thanks. Fixed by updating the build.xml and removing the old jar. That explodes the patch due to the jars checked into the repo. The linked PR mirrors it. I'd also like to pair on evaluating TinyLFU for OHC, but would like to see this go in before we jump into that.
          Hide
          snazy Robert Stupp added a comment -

          Sorry for the delay. Still on my plate.

          Show
          snazy Robert Stupp added a comment - Sorry for the delay. Still on my plate.
          Hide
          snazy Robert Stupp added a comment -

          Rebased your patch against current trunk and triggered CI.

          trunk branch testall dtest
          Show
          snazy Robert Stupp added a comment - Rebased your patch against current trunk and triggered CI. trunk branch testall dtest
          Hide
          ben.manes Ben Manes added a comment -

          The test failure might be due to delegating maintenance work (e.g. writes triggering an eviction) to an executor. CLHM and Guava amortized this on the calling threads, whereas Caffeine tries to hide it on ForkJoinPool to minimize user-facing latencies. By setting Caffeine.executor(Runnable::run) it will behave similar to its predecessors, ideally set only in tests for predictability. Alternatively calling cache.cleanUp() prior to inspecting is another easy alternative.

          I don't want to delay this again, but would like to ensure Cassandra updates to the upcoming 2.3.4 release. There is a difficult to trigger race in JCTools that we fixed over the weekend.

          Show
          ben.manes Ben Manes added a comment - The test failure might be due to delegating maintenance work (e.g. writes triggering an eviction) to an executor. CLHM and Guava amortized this on the calling threads, whereas Caffeine tries to hide it on ForkJoinPool to minimize user-facing latencies. By setting Caffeine.executor(Runnable::run) it will behave similar to its predecessors, ideally set only in tests for predictability. Alternatively calling cache.cleanUp() prior to inspecting is another easy alternative. I don't want to delay this again, but would like to ensure Cassandra updates to the upcoming 2.3.4 release. There is a difficult to trigger race in JCTools that we fixed over the weekend.
          Hide
          ben.manes Ben Manes added a comment -

          Released 2.3.4 with JCTools fix. Please revisit.

          Show
          ben.manes Ben Manes added a comment - Released 2.3.4 with JCTools fix. Please revisit.
          Hide
          snazy Robert Stupp added a comment -

          Alright, finally found some time to look at this.

          Some comments & questions:

          • I've added .executor(MoreExecutors.directExecutor()) - hope I got your suggestion right.
          • AuthCache: there's a bunch of cache.policy().refreshAfterWrite().ifPresent( code sequences. I guess this is just to use the existing API "right" so it always sets all the values, right? Maybe just add a line comment that all values are always set since these are mandatory.
          • ConcurrentLinkedHashCache is no longer used in the whole code base. You can remove it.
          • SerializingCache can you correct the typo in throw new IllegalArgumentException("Serialized size cannot be more than 2GB"); to .. must not be more...
          • There are a couple of cache.asMap() calls. Would it be an option to eagerly create the AsMapView, Values and EntrySet instances in LocalAsyncLoadingCache to get around the ternaries in asMap(), values(), entrySet() and keySet?

          Beside these, I've got no comments. Pretty straight forward change. Good work so far!

          Triggered a new CI round with the changes mentioned above.
          Do you have some micro-benchmarks in place to actually test against the previous implementation(s)? If Benedict (IIRC he already worked on Caffeine) says it buys us something, I'd be fine omitting the benchmarks.

          When you wanna change something, can you fork from my branch? This makes merging and checking changes easier.

          Regarding OHC, I'd need to fully understand Caffeine internals, i.e. the eviction algorithm, first. Mind opening a ticket for OHC on gh?

          Show
          snazy Robert Stupp added a comment - Alright, finally found some time to look at this. Some comments & questions: I've added .executor(MoreExecutors.directExecutor()) - hope I got your suggestion right. AuthCache : there's a bunch of cache.policy().refreshAfterWrite().ifPresent( code sequences. I guess this is just to use the existing API "right" so it always sets all the values, right? Maybe just add a line comment that all values are always set since these are mandatory. ConcurrentLinkedHashCache is no longer used in the whole code base. You can remove it. SerializingCache  can you correct the typo in throw new IllegalArgumentException("Serialized size cannot be more than 2GB"); to .. must not be more... There are a couple of cache.asMap() calls. Would it be an option to eagerly create the AsMapView , Values and EntrySet instances in LocalAsyncLoadingCache to get around the ternaries in asMap() , values() , entrySet() and keySet ? Beside these, I've got no comments. Pretty straight forward change. Good work so far! Triggered a new CI round with the changes mentioned above. Do you have some micro-benchmarks in place to actually test against the previous implementation(s)? If Benedict (IIRC he already worked on Caffeine) says it buys us something, I'd be fine omitting the benchmarks. When you wanna change something, can you fork from my branch? This makes merging and checking changes easier. Regarding OHC, I'd need to fully understand Caffeine internals, i.e. the eviction algorithm, first. Mind opening a ticket for OHC on gh?
          Hide
          ben.manes Ben Manes added a comment -

          I've added .executor(MoreExecutors.directExecutor()) - hope I got your suggestion right.

          This is good for unit tests to remove asynchronous behavior. My preference is to not use it in production, especially where latencies matter, by not penalizing callers with maintenance or removal notification work. Instead deferring that to FJP should help minimize response times, which I think would be your preference too. I'm not familiar enough with Cassandra's testing to know whether its trivial to flag the executor. Usually its pretty trivial, especially when DI like Guice is used.

          There are a couple of cache.asMap() calls. Would it be an option to eagerly create the AsMapView, Values and EntrySet instances in LocalAsyncLoadingCache to get around the ternaries in asMap(), values(), entrySet() and keySet?

          LocalAsyncLoadingCache isn't used by Cassandra (a cache that returns CompletableFuture. Given the ternaries are null checks to lazily create views, as is common in the Java Collections, I don't think its a measurable penalty to keep.

          Do you have some micro-benchmarks in place to actually test against the previous implementation(s)?

          For concurrent throughput (JMH) see these benchmarks. They show a refinements over CLHM, with a primary benefit in write. Since the cache now supports memoization, the Cassandra APIs might benefit from using a computation instead of racy get-load-put calls.

          For hit rates see these simulations. They show W-TinyLFU improves upon LRU by taking into account frequency.

          I'll send a PR to your branch when I get a chance to go through the rest of the comments. Thanks!

          Show
          ben.manes Ben Manes added a comment - I've added .executor(MoreExecutors.directExecutor()) - hope I got your suggestion right. This is good for unit tests to remove asynchronous behavior. My preference is to not use it in production, especially where latencies matter, by not penalizing callers with maintenance or removal notification work. Instead deferring that to FJP should help minimize response times, which I think would be your preference too. I'm not familiar enough with Cassandra's testing to know whether its trivial to flag the executor. Usually its pretty trivial, especially when DI like Guice is used. There are a couple of cache.asMap() calls. Would it be an option to eagerly create the AsMapView, Values and EntrySet instances in LocalAsyncLoadingCache to get around the ternaries in asMap(), values(), entrySet() and keySet? LocalAsyncLoadingCache isn't used by Cassandra (a cache that returns CompletableFuture . Given the ternaries are null checks to lazily create views, as is common in the Java Collections, I don't think its a measurable penalty to keep. Do you have some micro-benchmarks in place to actually test against the previous implementation(s)? For concurrent throughput (JMH) see these benchmarks . They show a refinements over CLHM, with a primary benefit in write. Since the cache now supports memoization, the Cassandra APIs might benefit from using a computation instead of racy get-load-put calls. For hit rates see these simulations . They show W-TinyLFU improves upon LRU by taking into account frequency. I'll send a PR to your branch when I get a chance to go through the rest of the comments. Thanks!
          Hide
          snazy Robert Stupp added a comment -

          How big is the penalty without the JFP? On an idle system FJP should give a latency improvement, if cost-of-task is greater than cost-of-scheduling. On a loaded system however, handing off work to a separate thread might add a penalty due to scheduling & context switching. But that is probably a discussion that requires knowledge how Caffeine works internally. I'm just cautious since I do not like to add more threads than necessary.

          (If we go with FJP, the unit test needs to be fixed.)

          Show
          snazy Robert Stupp added a comment - How big is the penalty without the JFP? On an idle system FJP should give a latency improvement, if cost-of-task is greater than cost-of-scheduling. On a loaded system however, handing off work to a separate thread might add a penalty due to scheduling & context switching. But that is probably a discussion that requires knowledge how Caffeine works internally. I'm just cautious since I do not like to add more threads than necessary. (If we go with FJP, the unit test needs to be fixed.)
          Hide
          ben.manes Ben Manes added a comment -

          The penalty is small. In CLHM (and Guava) we didn't have a system executor to exploit. The same approach is used of amortizing the maintenance work by buffering and replaying operations, instead of locking to perform them immediately. There is a slightly higher cost due to hashing for the CountMinSketch but overall its tiny. Using a direct executor should be fine.

          Please read the HighScalability article for an overview of internals.

          Show
          ben.manes Ben Manes added a comment - The penalty is small. In CLHM (and Guava) we didn't have a system executor to exploit. The same approach is used of amortizing the maintenance work by buffering and replaying operations, instead of locking to perform them immediately. There is a slightly higher cost due to hashing for the CountMinSketch but overall its tiny. Using a direct executor should be fine. Please read the HighScalability article for an overview of internals.
          Hide
          snazy Robert Stupp added a comment -

          Ping Ben Manes

          Show
          snazy Robert Stupp added a comment - Ping Ben Manes
          Hide
          ben.manes Ben Manes added a comment -

          I think it's on your plate.
          https://github.com/snazy/cassandra/pull/1

          Show
          ben.manes Ben Manes added a comment - I think it's on your plate. https://github.com/snazy/cassandra/pull/1
          Hide
          snazy Robert Stupp added a comment -

          Heh, classic
          Overseen the PR

          Show
          snazy Robert Stupp added a comment - Heh, classic Overseen the PR
          Hide
          snazy Robert Stupp added a comment -

          PR looks good - also rebased and removed the concurrentlinkedhashmap jar as it's no longer used. Triggered CI.

          Show
          snazy Robert Stupp added a comment - PR looks good - also rebased and removed the concurrentlinkedhashmap jar as it's no longer used. Triggered CI.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ben-manes closed the pull request at:

          https://github.com/apache/cassandra/pull/59

          Show
          githubbot ASF GitHub Bot added a comment - Github user ben-manes closed the pull request at: https://github.com/apache/cassandra/pull/59
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user ben-manes commented on the issue:

          https://github.com/apache/cassandra/pull/59

          Closing as merged into @snazy's integration branch.

          Show
          githubbot ASF GitHub Bot added a comment - Github user ben-manes commented on the issue: https://github.com/apache/cassandra/pull/59 Closing as merged into @snazy's integration branch.
          Hide
          snazy Robert Stupp added a comment -

          Unit tests look good. However, a couple of dtests fail - unfortunately a situation for which we do not have a unit test.
          Thing is that exception handing in caffeine is different now. It manifests for example in the catch-clause in org.apache.cassandra.auth.PasswordAuthenticator#authenticate. The loader calls queryHashedPassword, which throws org.apache.cassandra.auth.PasswordAuthenticator.NoSuchCredentialsException (extends RuntimeException). I assume that caffeine just "passes" REs and does not wrap it in an some ExecutionException.
          Example assertion message from the dtests looks like that's the reason: '\'Failed to authenticate to 127.0.0.1: Error from server: code=0100 [Bad credentials] message="Provided username mike and/or password are incorrect"\' not found in \'Failed to authenticate to 127.0.0.1: Error from server: code=0000 [Server error] message="org.apache.cassandra.auth.PasswordAuthenticator$NoSuchCredentialsException"\'

          Ben Manes can you verify that REs are just passed through and don't get wrapped in caffeine's loader mechanism?

          trunk branch testall dtest
          Show
          snazy Robert Stupp added a comment - Unit tests look good. However, a couple of dtests fail - unfortunately a situation for which we do not have a unit test. Thing is that exception handing in caffeine is different now. It manifests for example in the catch-clause in org.apache.cassandra.auth.PasswordAuthenticator#authenticate . The loader calls queryHashedPassword , which throws org.apache.cassandra.auth.PasswordAuthenticator.NoSuchCredentialsException (extends RuntimeException ). I assume that caffeine just "passes" REs and does not wrap it in an some ExecutionException . Example assertion message from the dtests looks like that's the reason: '\'Failed to authenticate to 127.0.0.1: Error from server: code=0100 [Bad credentials] message="Provided username mike and/or password are incorrect"\' not found in \'Failed to authenticate to 127.0.0.1: Error from server: code=0000 [Server error] message="org.apache.cassandra.auth.PasswordAuthenticator$NoSuchCredentialsException"\' Ben Manes can you verify that REs are just passed through and don't get wrapped in caffeine's loader mechanism? trunk branch testall dtest
          Hide
          ben.manes Ben Manes added a comment -

          Good observation Robert Stupp.

          Yes, RE is propagated as is and will never be wrapped.

          A checked exception will be wrapped with a CompletionException, the unchecked version of ExecutionException. (Guava uses its own UncheckedExecutionException)

          Show
          ben.manes Ben Manes added a comment - Good observation Robert Stupp . Yes, RE is propagated as is and will never be wrapped. A checked exception will be wrapped with a CompletionException, the unchecked version of ExecutionException. (Guava uses its own UncheckedExecutionException)
          Hide
          snazy Robert Stupp added a comment -

          Yes, that's it. dtests & utests look good now.
          Can you take a look at my last commit to cross-check what I did? I removed that NoSuchCredentialsException altogether as it's no longer necessary (we refactored CassandraException to extend RuntimeException just recently - that NoSuchCredentialsException is older).

          Show
          snazy Robert Stupp added a comment - Yes, that's it. dtests & utests look good now. Can you take a look at my last commit to cross-check what I did? I removed that NoSuchCredentialsException altogether as it's no longer necessary (we refactored CassandraException to extend RuntimeException just recently - that NoSuchCredentialsException is older).
          Hide
          ben.manes Ben Manes added a comment -

          LGTM

          Show
          ben.manes Ben Manes added a comment - LGTM
          Hide
          snazy Robert Stupp added a comment -

          Alright, utests & dtests look good now.

          Committed as c607d76413be81a0e125c5780e068d7ab7594612 to trunk

          Thanks!

          Show
          snazy Robert Stupp added a comment - Alright, utests & dtests look good now. Committed as c607d76413be81a0e125c5780e068d7ab7594612 to trunk Thanks!
          Show
          ifesdjeen Alex Petrov added a comment - - edited This commit broke an authentication test: http://cassci.datastax.com/view/Dev/view/snazy/job/snazy-10855-caffeine-trunk-dtest/lastSuccessfulBuild/testReport/auth_test/TestAuth/system_auth_ks_is_alterable_test/ (still failing on trunk)
          Hide
          ifesdjeen Alex Petrov added a comment -

          Opened a follow-up ticket: CASSANDRA-13367

          Show
          ifesdjeen Alex Petrov added a comment - Opened a follow-up ticket: CASSANDRA-13367

            People

            • Assignee:
              ben.manes Ben Manes
              Reporter:
              ben.manes Ben Manes
              Reviewer:
              Robert Stupp
            • Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development