In a 6 node loadtest cluster, we have been running with 2.1.18 a certain production-like workload constantly and sufficiently. After upgrading one node to 3.0.18 (remaining 5 still on 2.1.18 after we have seen that sort of regression described below), 3.0.18 is showing increased CPU usage, increase GC, high mutation stage pending tasks, dropped mutation messages ...
Some spec. All 6 nodes equally sized:
- Bare metal, 32 physical cores, 512G RAM
- Xmx31G, G1, max pause millis = 2000ms
- cassandra.yaml basically unchanged, thus same settings in regard to number of threads, compaction throttling etc.
Following dashboard shows highlighted areas (CPU, suspension) with metrics for all 6 nodes and the one outlier being the node upgraded to Cassandra 3.0.18.
Additionally we see a large increase on pending tasks in the mutation stage after the upgrade:
And dropped mutation messages, also confirmed in the Cassandra log:
Judging from a 15min JFR session for both, 3.0.18 and 2.1.18 on a different node, high-level, it looks like the code path underneath BatchMessage.execute is producing ~ 10x more on-heap allocations in 3.0.18 compared to 2.1.18.
Left => 3.0.18
Right => 2.1.18
JFRs zipped are exceeding the 60MB limit to directly attach to the ticket. I can upload them, if there is another destination available.