[CASSANDRA-16499] single-threaded write workloads can spend ~70% time waiting on SEPExecutor - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Low
Resolution: Unresolved
Fix Version/s: None
Component/s: Local/Other
Labels:
- performance

Change Category:
Performance
Complexity:
Challenging
Platform:

All
Impacts:

None

Description

Hi all! While conducting benchmarking of Cassandra against other databases for a particular healthcare solution, I found some surprising anomalies in single-threaded write performance.

Analysis & tracing suggest there might be some inefficiencies in inter-thread execution in Cassandra;

Tracing showed an average delay of 1.52 ms between StorageProxy.performLocally() being called, and the LocalMutationRunnable actually executing.
Total operation time averaged 2.06 ms (measured at Message.Dispatcher processRequest()). This suggested ~72% of the total operation time being lost waiting for thread scheduling in SEPExecutor.
With I tested with multiple threads, performance with 10 threads was 27x higher. This supports a hypothesis that scheduling delays may be hindering single-threaded progress.
Transaction throughput for Cassandra with a single-threaded workload, measured far lower than that of PostgreSQL on the same hardware. Postgres achieved ~200k committed transactions/ minute including fsync; Cassandra achieves ~37k per minute. Given they are both essentially writing to a commit log, it may be informative to understand why the differences are arising.

Cassandra's architecture seems in theory like it might be aligned for our usecase, given the Commit Log and Log Structured Merge design. Some of our customers have configurations posing high single-threaded loads. Therefore I spent some time trying to understand why efficiency for such loads seemed less than expected.

My investigation so far:

benchmarked Cassandra 3.11.10
stack-dumped it under load & identified a pattern of threads waiting in AbstractWriteResponseHandler while nothing else is busy
checked out Cassandra 3.11.10 source, built it, debugged & stepped through key areas to try and understand behavior.
instrumented key areas with custom tracing code & timestamps to 0.01 millisecond.
- see patch attached.
benchmarked Cassandra 4 beta 4 & verified delays also present
shown & traced delays with my healthcare scenario benchmark
shown & traced delays with the Cassandra stress-test tool.

The configuration was:

single-node Cassandra running locally, on a recent Dell laptop with NVmE SSD.
for the healthcare scenario:
- Java client app running 1 or 10 threads;
- trialled LOCAL_ONE and ANY consistency levels;
- trialled unbatched, BatchType.UNLOGGED and BatchType.LOGGED.
for 'cassandra-stress':
- cassandra-stress.bat write n=10000 -rate threads=1

Without deeply understanding the code, I have considered a couple of possible areas/ ideas as to improvement. I worked on the 3.11.10 codebase. I'd be interested to understand whether or not these might be sound or not; note that neither achieves as much improvement as might theoretically be hoped for.

My investigations based on the key observation of large delays between StorageProxy.performLocally() being invoked and the LocalMutationRunnable actually executing, for single-threaded workloads.

What I looked at:

Without fully understanding SEPExecutor.takeWorkPermit() – it answers true to execute immediately, false if scheduled – for this workload scheduling seemed slow.
- takeWorkPermit() answers false if no work permits are available.
- I noticed takeWorkPermit() also answers false if no task permits are available, even if no task permit need be taken.
- by changing this condition I was able to gain +45% performance.
Without deeply understanding SEP Executor/ Worker or sleep algorithms, I noted that in a single-thread workload SEPWorkers would likely spin & be put to sleep for a period after completing each task.
- I did wonder if the park ~~times~~ or parking behavior (empirically observed at 10,000 - 20,000 nanos) could contribute to threads being more aggressively de-scheduled.
- an experiment in keeping 1 SEPWorker awake (not sleeping at all) gained +7.9% performance.
- Note: initial ticket misread code as requesting 500,000 nanosecond sleeps. This has now been corrected.
Without deeply understanding SEP Executor/ Worker, I feel there may be more questions around how SEP Workers are brought out of SPINNING/ sleep state and whether this logic functions promptly & correctly.
- At a very initial stage of investigation: SEPWorker.assign() unparks threads when transitioning out of STOPPED state, but code appears potentially not to unpark threads coming out of SPINNING state.
- This is a very cursory "looking at the code" & initial debugging stage, but I'm not certain it's accurate. Attempted experiments to unpark for the SPINNING -> Work transition so far have caused lockups of 100% machine CPU use or dropped messages, rather than helping anything.
- If & when I can find out more, I'll post it here.

I will post the tracing code & traces I captured, and welcome some feedback and thoughts on these performance questions from the Cassandra dev community. Thanks all!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

AMI Linux test -- 09.txt
11/Mar/21 04:23
24 kB
Tom Whitmore
analysis of David Capwell's latency stats -- 01.xlsx
16/Mar/21 03:07
9 kB
Tom Whitmore
Cassandra Write trace 5; warmed up -- 02.txt
08/Mar/21 04:48
9 kB
Tom Whitmore
MaybeStartSpinning Unpark fix; Cassandra Stress results -- 02.txt
10/Mar/21 00:13
6 kB
Tom Whitmore
MaybeStartSpinning Unpark fix; Linux benchmarks -- 07.xlsx
11/Mar/21 04:23
38 kB
Tom Whitmore
MaybeStartSpinning Unpark fix on 4beta4; Cassandra Stress results -- 01.txt
10/Mar/21 03:24
5 kB
Tom Whitmore
proposed fix patches.zip
10/Mar/21 00:13
2 kB
Tom Whitmore
Screen Shot 2021-03-15 at 10.55.05 AM.png
15/Mar/21 19:12
117 kB
David Capwell
Screen Shot 2021-03-15 at 10.55.14 AM.png
15/Mar/21 19:12
132 kB
David Capwell
Screen Shot 2021-03-15 at 10.56.02 AM.png
15/Mar/21 19:12
133 kB
David Capwell
Screen Shot 2021-03-15 at 10.56.08 AM.png
15/Mar/21 19:12
133 kB
David Capwell
Screen Shot 2021-03-15 at 10.56.14 AM.png
15/Mar/21 19:12
119 kB
David Capwell
Screen Shot 2021-03-15 at 10.56.25 AM.png
15/Mar/21 19:12
133 kB
David Capwell
Screen Shot 2021-03-15 at 10.57.22 AM.png
15/Mar/21 19:12
171 kB
David Capwell
Screen Shot 2021-03-15 at 10.57.31 AM.png
15/Mar/21 19:12
138 kB
David Capwell
Screen Shot 2021-03-15 at 10.57.40 AM.png
15/Mar/21 19:12
125 kB
David Capwell
Screen Shot 2021-03-15 at 10.57.47 AM.png
15/Mar/21 19:12
136 kB
David Capwell
SEPWorker trace 2 delay examples -- 01.txt
10/Mar/21 00:13
20 kB
Tom Whitmore
SEPWorker trace 2 delays.txt
10/Mar/21 00:13
111 kB
Tom Whitmore
SEPWorker trace 3 delays; with proposed fix.txt
10/Mar/21 00:13
55 kB
Tom Whitmore
Single-thread Latencies report -- 01.xlsx
08/Mar/21 04:48
11 kB
Tom Whitmore
Stress Write 2 sgl-thread vs 10 threads -- 01.txt
08/Mar/21 22:16
32 kB
Tom Whitmore
Stress Write sgl-thread 1 -- 01.txt
08/Mar/21 04:47
11 kB
Tom Whitmore
Stress Write trace 1.txt
08/Mar/21 04:48
28 kB
Tom Whitmore
tracing & experimental change patches.zip
08/Mar/21 04:48
15 kB
Tom Whitmore

single-threaded write workloads can spend ~70% time waiting on SEPExecutor

Details

Description

Attachments

Attachments

Activity

People

Dates