[CASSANDRA-15430] Cassandra 3.0.18: BatchMessage.execute - 10x more on-heap allocations compared to 2.1.18 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Normal
Resolution: Duplicate
Fix Version/s: 3.0.23
Component/s: Local/Other
Labels:
None

Bug Category:
Degradation - Resource Management
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

3.0 alpha 1

Description

In a 6 node loadtest cluster, we have been running with 2.1.18 a certain production-like workload constantly and sufficiently. After upgrading one node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of regression described below), 3.0.18 is showing increased CPU usage, increase GC, high mutation stage pending tasks, dropped mutation messages ...

Some spec. All 6 nodes equally sized:

Bare metal, 32 physical cores, 512G RAM
Xmx31G, G1, max pause millis = 2000ms
cassandra.yaml basically unchanged, thus same settings in regard to number of threads, compaction throttling etc.

Following dashboard shows highlighted areas (CPU, suspension) with metrics for all 6 nodes and the one outlier being the node upgraded to Cassandra 3.0.18.

Additionally we see a large increase on pending tasks in the mutation stage after the upgrade:

And dropped mutation messages, also confirmed in the Cassandra log:

INFO  [ScheduledTasks:1] 2019-11-15 08:24:24,780 MessagingService.java:1022 - MUTATION messages were dropped in last 5000 ms: 41552 for internal timeout and 0 for cross node timeout
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,157 StatusLogger.java:52 - Pool Name                    Active   Pending      Completed   Blocked  All Time Blocked
INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - MutationStage                   256     81824     3360532756         0                 0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ViewMutationStage                 0         0              0         0                 0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,168 StatusLogger.java:56 - ReadStage                         0         0       62862266         0                 0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - RequestResponseStage              0         0     2176659856         0                 0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - ReadRepairStage                   0         0              0         0                 0

INFO  [ScheduledTasks:1] 2019-11-15 08:24:25,169 StatusLogger.java:56 - CounterMutationStage              0         0              0         0                 0
...

Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different node, high-level, it looks like the code path underneath BatchMessage.execute is producing ~ 10x more on-heap allocations in 3.0.18 compared to 2.1.18.

Left => 3.0.18
Right => 2.1.18

JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I can upload them, if there is another destination available.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

dashboard.png
15/Nov/19 12:55
286 kB
Thomas Steinmaurer
mutation_stage.png
15/Nov/19 12:57
27 kB
Thomas Steinmaurer
jfr_allocations.png
15/Nov/19 13:02
226 kB
Thomas Steinmaurer
screenshot-1.png
17/Jan/20 12:22
217 kB
Thomas Steinmaurer
screenshot-2.png
17/Jan/20 12:27
220 kB
Thomas Steinmaurer
screenshot-3.png
17/Jan/20 12:28
210 kB
Thomas Steinmaurer
screenshot-4.png
17/Jan/20 13:05
13 kB
Thomas Steinmaurer
jfr_jmc_2-1.png
12/Oct/20 20:29
2.33 MB
Michael Semb Wever
jfr_jmc_2-1_obj.png
12/Oct/20 20:29
1015 kB
Michael Semb Wever
jfr_jmc_3-0.png
12/Oct/20 20:29
2.44 MB
Michael Semb Wever
jfr_jmc_3-0_obj.png
12/Oct/20 20:29
1.11 MB
Michael Semb Wever
jfr_jmc_3-11.png
12/Oct/20 20:30
2.50 MB
Michael Semb Wever
jfr_jmc_3-11_obj.png
12/Oct/20 20:30
1.12 MB
Michael Semb Wever
jfr_jmc_4-0-b2.png
12/Oct/20 20:30
2.53 MB
Michael Semb Wever
jfr_jmc_4-0-b2_obj.png
12/Oct/20 20:30
1.06 MB
Michael Semb Wever
jfr_jmc_3-0_obj_obj_alloc.png
12/Oct/20 20:50
2.21 MB
Michael Semb Wever

Issue Links

duplicates

CASSANDRA-16201 Reduce amount of allocations during batch statement execution

Resolved

relates to

CASSANDRA-13929 BTree$Builder / io.netty.util.Recycler$Stack leaking memory

Resolved

CASSANDRA-16201 Reduce amount of allocations during batch statement execution

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Steinmaurer

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 15/Nov/19 13:04

Updated:: 01/Aug/21 12:15

Resolved:: 14/Oct/20 07:14