Uploaded image for project: 'Cassandra'
  1. Cassandra
  2. CASSANDRA-15922

High CAS failures in NativeAllocator.Region.allocate(..)





      The method NativeAllocator.Region.allocate(..) uses an AtomicInteger for the current offset in the region. Allocations depends on a .compareAndSet(..) call.

      In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node.


      It has been witnessed up to 33% of CPU time stuck in the NativeAllocator.Region.allocate(..) loop (due to the CAS failures) during a heavy spark analytics write load.

      These nodes: 40 CPU cores and 256GB ram; have relevant settings

      • memtable_allocation_type: offheap_objects
      • memtable_offheap_space_in_mb: 5120
      • concurrent_writes: 160

      Numerous flamegraphs demonstrate the problem. See attached profile_pbdpc23zafsrh_20200702.svg.

      Suggestion: ThreadLocal Regions

      One possible solution is to have separate Regions per thread.
      Code wise this is relatively easy to do, for example replacing NativeAllocator:59

      private final AtomicReference<Region> currentRegion = new AtomicReference<>();


      private final ThreadLocal<AtomicReference<Region>> currentRegion = new ThreadLocal<>() {...};

      But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with concurrent_writes: 160 that's 160+ regions, each of 1MB.

      Suggestion: Simple Contention Management Algorithm (Constant Backoff)

      Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time.

      The research paper arXiv:1305.5800 describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm.

      Applying the Constant Backoff CAS Algorithm involves adding one line of code to NativeAllocator.Region.allocate(..) to sleep for one (or some constant number) nanoseconds after a CAS failure occurs.
      That is...

          // we raced and lost alloc, try again

      Constant Backoff CAS Algorithm Experiments

      Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of NativeAllocator.Region.allocate(..) can be demonstrated.

      In the attached NativeAllocatorRegionTest.java class, which can be run standalone, the Region class: copied from NativeAllocator.Region; has also the casFailures field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the NativeAllocatorRegionTest.testRegionCAS method.

      This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 million allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 10000000ns (10ms). The biggest improvement is from no algorithm to a park time of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates.

      This attached screenshot shows the time it takes to fill a Region (~215 million allocations), using different threads and park times. The biggest improvement is from no algorithm to a park time of 1ns where performance is one order of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates.

      Repeating the test run show reliably similar results: Screen Shot 2020-07-05 at 13.37.01.png and Screen Shot 2020-07-05 at 13.35.55.png.

      Region Per Thread Experiments

      Implementing Region Per Thread: see the NativeAllocatorRegionTest.testRegionThreadLocal method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms).


      Region per Thread is an unrealistic solution as it introduces many new issues and problems, from increased memory use to leaking memory and GC issues. It is better tackled as part of a TPC implementation.

      The backoff approach is simple and elegant, and seems to improve throughput in all situations. It does introduce context switches which may impact throughput in some busy throughput scenarios, so this should to be tested further.


        1. Screen Shot 2020-07-06 at 13.26.10.png
          154 kB
          Michael Semb Wever
        2. Screen Shot 2020-07-06 at 11.36.44.png
          161 kB
          Michael Semb Wever
        3. Screen Shot 2020-07-06 at 11.35.35.png
          173 kB
          Michael Semb Wever
        4. Screen Shot 2020-07-05 at 13.48.16.png
          170 kB
          Michael Semb Wever
        5. Screen Shot 2020-07-05 at 13.37.01.png
          186 kB
          Michael Semb Wever
        6. Screen Shot 2020-07-05 at 13.35.55.png
          173 kB
          Michael Semb Wever
        7. Screen Shot 2020-07-05 at 13.26.17.png
          177 kB
          Michael Semb Wever
        8. Screen Shot 2020-07-05 at 13.16.10.png
          186 kB
          Michael Semb Wever
        9. profile_pbdpc23zafsrh_20200702.svg
          395 kB
          Michael Semb Wever
        10. NativeAllocatorRegionTest.java
          5 kB
          Michael Semb Wever
        11. NativeAllocatorRegion2Test.java
          4 kB
          Michael Semb Wever

        Issue Links



              mck Michael Semb Wever
              mck Michael Semb Wever
              Michael Semb Wever
              Benedict Elliott Smith, Robert Stupp
              0 Vote for this issue
              6 Start watching this issue