Bug Category:Degradation - Slow Use Case
Discovered By:User Report
Source Control Link:
The method NativeAllocator.Region.allocate(..) uses an AtomicInteger for the current offset in the region. Allocations depends on a .compareAndSet(..) call.
In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node.
It has been witnessed up to 33% of CPU time stuck in the NativeAllocator.Region.allocate(..) loop (due to the CAS failures) during a heavy spark analytics write load.
These nodes: 40 CPU cores and 256GB ram; have relevant settings
- memtable_allocation_type: offheap_objects
- memtable_offheap_space_in_mb: 5120
- concurrent_writes: 160
Numerous flamegraphs demonstrate the problem. See attached profile_pbdpc23zafsrh_20200702.svg.
One possible solution is to have separate Regions per thread.
Code wise this is relatively easy to do, for example replacing NativeAllocator:59
But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with concurrent_writes: 160 that's 160+ regions, each of 1MB.
Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time.
The research paper arXiv:1305.5800 describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm.
Applying the Constant Backoff CAS Algorithm involves adding one line of code to NativeAllocator.Region.allocate(..) to sleep for one (or some constant number) nanoseconds after a CAS failure occurs.
Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of NativeAllocator.Region.allocate(..) can be demonstrated.
In the attached NativeAllocatorRegionTest.java class, which can be run standalone, the Region class: copied from NativeAllocator.Region; has also the casFailures field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the NativeAllocatorRegionTest.testRegionCAS method.
This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 million allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 10000000ns (10ms). The biggest improvement is from no algorithm to a park time of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates.
This attached screenshot shows the time it takes to fill a Region (~215 million allocations), using different threads and park times. The biggest improvement is from no algorithm to a park time of 1ns where performance is one order of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates.
Implementing Region Per Thread: see the NativeAllocatorRegionTest.testRegionThreadLocal method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms).
Region per Thread is an unrealistic solution as it introduces many new issues and problems, from increased memory use to leaking memory and GC issues. It is better tackled as part of a TPC implementation.
The backoff approach is simple and elegant, and seems to improve throughput in all situations. It does introduce context switches which may impact throughput in some busy throughput scenarios, so this should to be tested further.