[CASSANDRA-13900] Massive GC suspension increase after updating to 3.0.14 from 2.1.18 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Urgent
Resolution: Duplicate
Fix Version/s: None
Component/s: Legacy/Core
Labels:
None

Severity:
Critical

Description

In short: After upgrading to 3.0.14 (from 2.1.18), we aren't able to process the same incoming write load on the same infrastructure anymore.

We have a loadtest environment running 24x7 testing our software using Cassandra as backend. Both, loadtest and production is hosted in AWS and do have the same spec on the Cassandra-side, namely:

9x m4.xlarge
8G heap
CMS (400MB newgen)
2TB EBS gp2
Client requests are entirely CQL

per node. We have a solid/constant baseline in loadtest at ~ 60% CPU cluster AVG with constant, simulated load running against our cluster, using Cassandra 2.1 for > 2 years now.

Recently we started to upgrade to 3.0.14 in this 9 node loadtest environment, and basically, 3.0.14 isn't able to cope with the load anymore. No particular special tweaks, memory settings/changes etc., all the same as in 2.1.18. We also didn't upgrade sstables yet, thus the increase mentioned in the screenshot is not related to any manually triggered maintenance operation after upgrading to 3.0.14.

According to our monitoring, with 3.0.14, we see a GC suspension time increase by a factor of > 2, of course directly correlating with an CPU increase > 80%. See: attached screen "cassandra2118_vs_3014.jpg"

This all means that our incoming load against 2.1.18 is something, 3.0.14 can't handle. So, we would need to either scale up (e.g. m4.xlarge => m4.2xlarge) or scale out for being able to handle the same load, which is cost-wise not an option.

Unfortunately I do not have Java Flight Recorder runs for 2.1.18 at the mentioned load, but can provide JFR session for our current 3.0.14 setup. The attached 5min JFR memory allocation area (cassandra3014_jfr_5min.jpg) shows compaction being the top contributor for the captured 5min time-frame. Could be by "accident" covering the 5min with compaction as top contributor only (although mentioned simulated client load is attached), but according to stack traces, we see new classes from 3.0, e.g. BTreeRow.searchIterator() etc. popping up as top contributor, thus possibly new classes / data structures are causing much more object churn now.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

cassandra3014_jfr_5min.jpg
25/Sep/17 11:05
856 kB
Thomas Steinmaurer
cassandra2118_vs_3014.jpg
25/Sep/17 10:45
394 kB
Thomas Steinmaurer
cassandra_3.11.0_min_memory_utilization.jpg
27/Sep/17 07:20
287 kB
Thomas Steinmaurer

Issue Links

Blocked

CASSANDRA-16201 Reduce amount of allocations during batch statement execution

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Thomas Steinmaurer

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 25/Sep/17 10:42

Updated:: 18/Nov/20 08:12

Resolved:: 18/Nov/20 08:12