[CASSANDRA-18773] Compactions are slow - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Normal
Resolution: Fixed
Fix Version/s: 4.0.12, 4.1.4, 5.0-alpha2, 5.0, 5.1
Component/s: Local/Compaction
Labels:
None

Change Category:
Performance
Complexity:
Normal
Platform:

All
Impacts:

None
Source Control Link:

https://github.com/apache/cassandra/commit/cb1f1399b139029e5b1c12a4bf65d19a55724933
Test and Documentation Plan:

Hide

CI

Show
CI

Description

I have noticed that compactions involving a lot of sstables are very slow (for example major compactions). I have attached a cassandra stress profile that can generate such a dataset under ccm. In my local test I have 2567 sstables at 4Mb each.

I added code to track wall clock time of various parts of the code. One problematic part is ManyToOne constructor. Tracing through the code for every partition creating a ManyToOne for all the sstable iterators for each partition. In my local test get a measy 60Kb/sec read speed, and bottlenecked on single core CPU (since this code is single threaded) with it spending 85% of the wall clock time in ManyToOne constructor.

As another datapoint to show its the merge iterator part of the code using the cfstats from https://github.com/instaclustr/cassandra-sstable-tools/ which reads all the sstables but does no merging gets 26Mb/sec read speed.

Tracking back from ManyToOne call I see this in UnfilteredPartitionIterators::merge

                for (int i = 0; i < toMerge.size(); i++)
                {
                    if (toMerge.get(i) == null)
                    {
                        if (null == empty)
                            empty = EmptyIterators.unfilteredRow(metadata, partitionKey, isReverseOrder);
                        toMerge.set(i, empty);
                    }
                }

Not sure what purpose of creating these empty rows are. But on a whim I removed all these empty iterators before passing to ManyToOne and then all the wall clock time shifted to CompactionIterator::hasNext() and read speed increased to 1.5Mb/s.

So there are further bottlenecks in this code path it seems, but the first is this ManyToOne and having to build it for every partition read.