[CASSANDRA-19534] Unbounded queues in native transport requests lead to node instability - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Urgent
Resolution: Fixed
Fix Version/s: 4.1.6, 5.0-rc1, 5.0, 5.1
Component/s: Legacy/Local Write-Read Paths
Labels:
None

Bug Category:
Availability
Severity:
Critical
Complexity:
Challenging
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

3.0.0
Source Control Link:

https://github.com/apache/cassandra/commit/dc17c29724d86547538cc8116ff1a90d36a0bf3a
Test and Documentation Plan:

Hide

Includes tests, also was tested separately; screenshots and description attached

Show
Includes tests, also was tested separately; screenshots and description attached

Description

When a node is under pressure, hundreds of thousands of requests can show up in the native transport queue, and it looks like it can take way longer to timeout than is configured. We should be shedding load much more aggressively and use a bounded queue for incoming work. This is extremely evident when we combine a resource consuming workload with a smaller one:

Running 5.0 HEAD on a single node as of today:

# populate only
easy-cass-stress run RandomPartitionAccess -p 100  -r 1 --workload.rows=100000 --workload.select=partition --maxrlat 100 --populate 10m --rate 50k -n 1

# workload 1 - larger reads
easy-cass-stress run RandomPartitionAccess -p 100  -r 1 --workload.rows=100000 --workload.select=partition --rate 200 -d 1d

# second workload - small reads
easy-cass-stress run KeyValue -p 1m --rate 20k -r .5 -d 24h

It appears our results don't time out at the requested server time either:

                 Writes                                  Reads                                  Deletes                       Errors
  Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   Count  Latency (p99)  1min (req/s) |   Count  1min (errors/s)
 950286       70403.93        634.77 |  789524       70442.07        426.02 |       0              0             0 | 9580484         18980.45
 952304       70567.62         640.1 |  791072       70634.34        428.36 |       0              0             0 | 9636658         18969.54
 953146       70767.34         640.1 |  791400       70767.76        428.36 |       0              0             0 | 9695272         18969.54
 956833       71171.28        623.14 |  794009        71175.6        412.79 |       0              0             0 | 9749377         19002.44
 959627       71312.58        656.93 |  795703       71349.87        435.56 |       0              0             0 | 9804907         18943.11

After stopping the load test altogether, it took nearly a minute before the requests were no longer queued.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-9.png
03/May/24 22:54
48 kB
Jon Haddad
screenshot-8.png
03/May/24 22:54
44 kB
Jon Haddad
screenshot-7.png
03/May/24 22:47
46 kB
Jon Haddad
screenshot-6.png
03/May/24 22:42
48 kB
Jon Haddad
screenshot-5.png
03/May/24 22:22
47 kB
Jon Haddad
screenshot-4.png
03/May/24 22:13
49 kB
Jon Haddad
screenshot-3.png
03/May/24 22:13
48 kB
Jon Haddad
screenshot-2.png
03/May/24 22:12
51 kB
Jon Haddad
screenshot-1.png
03/May/24 22:05
52 kB
Jon Haddad
Scenario 2 - Stock.jpg
23/Apr/24 10:41
844 kB
Alex Petrov
Scenario 2 - QUEUE + Backpressure.jpg
23/Apr/24 10:41
1.97 MB
Alex Petrov
Scenario 2 - QUEUE.jpg
23/Apr/24 10:41
1.80 MB
Alex Petrov
Scenario 1 - Stock.jpg
23/Apr/24 10:39
601 kB
Alex Petrov
Scenario 1 - QUEUE + Backpressure.jpg
23/Apr/24 10:39
584 kB
Alex Petrov
Scenario 1 - QUEUE.jpg
23/Apr/24 10:39
499 kB
Alex Petrov
image-2024-08-08-14-25-12-915.png
08/Aug/24 21:25
246 kB
Gaurav Agarwal
image-2024-08-07-11-37-58-417.png
07/Aug/24 18:37
40 kB
Gaurav Agarwal
image-2024-05-03-16-08-10-101.png
03/May/24 23:08
72 kB
Jon Haddad
ci_summary-trunk.html
27/May/24 19:15
108 kB
Alex Petrov
ci_summary-5.0.html
27/May/24 19:14
29 kB
Alex Petrov
ci_summary-4.1.html
28/May/24 12:48
34 kB
Alex Petrov
ci_summary.html
29/Apr/24 17:19
55 kB
Alex Petrov

Issue Links

relates to

CASSANDRA-19702 Test failure: largecolumn_test.TestLargeColumn

Resolved

supercedes

CASSANDRA-18766 high speculative retries on v4.1.3

Resolved

CASSANDRA-19215 "Query start time" in native transport request threads should be the task enqueue time

Resolved

links to

GitHub Pull Request #3274

Activity

People

Assignee:: Alex Petrov

Reporter:: Jon Haddad

Authors:: Alex Petrov

Reviewers:: Caleb Rackliffe

Votes:: 0 Vote for this issue

Watchers:: 13 Start watching this issue

Dates

Created:: 05/Apr/24 20:11

Updated:: 30/Aug/24 13:39

Resolved:: 31/May/24 09:36

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10h 10m