[CASSANDRA-15922] High CAS failures in NativeAllocator.Region.allocate(..) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 3.0.21, 3.11.7, 4.0-beta1, 4.0
Component/s: Local/Memtable
Labels:
None

Bug Category:
Degradation - Slow Use Case
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

1.0.0
Source Control Link:

https://github.com/apache/cassandra/commit/5a6d52b26191ab1f5df6e9cf941e0f964dd95a28
Test and Documentation Plan:

Hide

existing CI.
benchmarking in ticket.

Show
existing CI. benchmarking in ticket.

Description

Problem

The method NativeAllocator.Region.allocate(..) uses an AtomicInteger for the current offset in the region. Allocations depends on a .compareAndSet(..) call.

In highly contended environments the CAS failures can be high, starving writes in a running Cassandra node.

Example

It has been witnessed up to 33% of CPU time stuck in the NativeAllocator.Region.allocate(..) loop (due to the CAS failures) during a heavy spark analytics write load.

These nodes: 40 CPU cores and 256GB ram; have relevant settings

memtable_allocation_type: offheap_objects
memtable_offheap_space_in_mb: 5120
concurrent_writes: 160

Numerous flamegraphs demonstrate the problem. See attached profile_pbdpc23zafsrh_20200702.svg.

Suggestion: ThreadLocal Regions

One possible solution is to have separate Regions per thread.
Code wise this is relatively easy to do, for example replacing NativeAllocator:59

private final AtomicReference<Region> currentRegion = new AtomicReference<>();

with

private final ThreadLocal<AtomicReference<Region>> currentRegion = new ThreadLocal<>() {...};

But this approach substantially changes the allocation behaviour, with more than concurrent_writes number of Regions in use at any one time. For example with concurrent_writes: 160 that's 160+ regions, each of 1MB.

Suggestion: Simple Contention Management Algorithm (Constant Backoff)

Another possible solution is to introduce a contention management algorithm that a) reduces CAS failures in high contention environments, b) doesn't impact normal environments, and c) keeps the allocation strategy of using one region at a time.

The research paper arXiv:1305.5800 describes this contention CAS problem and demonstrates a number of algorithms to apply. The simplest of these algorithms is the Constant Backoff CAS Algorithm.

Applying the Constant Backoff CAS Algorithm involves adding one line of code to NativeAllocator.Region.allocate(..) to sleep for one (or some constant number) nanoseconds after a CAS failure occurs.
That is...

    // we raced and lost alloc, try again
    LockSupport.parkNanos(1);

Constant Backoff CAS Algorithm Experiments

Using the code attached in NativeAllocatorRegionTest.java the concurrency and CAS failures of NativeAllocator.Region.allocate(..) can be demonstrated.

In the attached NativeAllocatorRegionTest.java class, which can be run standalone, the Region class: copied from NativeAllocator.Region; has also the casFailures field added. The following two screenshots are from data collected from this class on a 6 CPU (12 core) MBP, running the NativeAllocatorRegionTest.testRegionCAS method.

This attached screenshot shows the number of CAS failures during the life of a Region (over ~215 million allocations), using different threads and park times. This illustrates the improvement (reduction) of CAS failures from zero park time, through orders of magnitude, up to 10000000ns (10ms). The biggest improvement is from no algorithm to a park time of 1ns where CAS failures are ~two orders of magnitude lower. From a park time 10μs and higher there is a significant drop also at low contention rates.

This attached screenshot shows the time it takes to fill a Region (~215 million allocations), using different threads and park times. The biggest improvement is from no algorithm to a park time of 1ns where performance is one order of magnitude faster. From a park time of 100μs and higher there is a even further significant drop, especially at low contention rates.

Repeating the test run show reliably similar results: Screen Shot 2020-07-05 at 13.37.01.png and Screen Shot 2020-07-05 at 13.35.55.png.

Region Per Thread Experiments

Implementing Region Per Thread: see the NativeAllocatorRegionTest.testRegionThreadLocal method; we can expect zero CAS failures of the life of a Region. For performance we see two orders of magnitude lower times to fill up the Region (~420ms).

Costs

Region per Thread is an unrealistic solution as it introduces many new issues and problems, from increased memory use to leaking memory and GC issues. It is better tackled as part of a TPC implementation.

The backoff approach is simple and elegant, and seems to improve throughput in all situations. It does introduce context switches which may impact throughput in some busy throughput scenarios, so this should to be tested further.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

profile_pbdpc23zafsrh_20200702.svg
05/Jul/20 10:38
395 kB
Michael Semb Wever
Screen Shot 2020-07-05 at 13.16.10.png
05/Jul/20 11:16
186 kB
Michael Semb Wever
Screen Shot 2020-07-05 at 13.26.17.png
05/Jul/20 11:26
177 kB
Michael Semb Wever
Screen Shot 2020-07-05 at 13.35.55.png
05/Jul/20 11:36
173 kB
Michael Semb Wever
Screen Shot 2020-07-05 at 13.37.01.png
05/Jul/20 11:37
186 kB
Michael Semb Wever
Screen Shot 2020-07-05 at 13.48.16.png
05/Jul/20 11:48
170 kB
Michael Semb Wever
NativeAllocatorRegionTest.java
06/Jul/20 09:40
5 kB
Michael Semb Wever
Screen Shot 2020-07-06 at 11.36.44.png
06/Jul/20 09:40
161 kB
Michael Semb Wever
Screen Shot 2020-07-06 at 11.35.35.png
06/Jul/20 09:40
173 kB
Michael Semb Wever
Screen Shot 2020-07-06 at 13.26.10.png
06/Jul/20 11:26
154 kB
Michael Semb Wever
NativeAllocatorRegion2Test.java
06/Jul/20 11:34
4 kB
Michael Semb Wever

Issue Links

is related to

CASSANDRA-16072 Reduce thread contention in CommitLogSegment and HintsBuffer by rewriting CAS loops to atomic adds

Resolved

relates to

CASSANDRA-2252 arena allocation for memtables

Resolved

Activity

People

Assignee:: Michael Semb Wever

Reporter:: Michael Semb Wever

Authors:: Michael Semb Wever

Reviewers:: Benedict Elliott Smith, Robert Stupp

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 05/Jul/20 11:50

Updated:: 03/Jan/21 17:02

Resolved:: 07/Jul/20 10:54