[CASSANDRA-18123] Reuse of metadata collector can break key count calculation - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Normal
Resolution: Unresolved
Fix Version/s: 3.0.x, 3.11.x, 4.0.x, 4.1.x, 5.x
Component/s: Local/Compaction
Labels:
None

Bug Category:
Degradation - Performance Bug/Regression
Severity:
Normal
Complexity:
Normal
Discovered By:
User Report
Platform:

All
Impacts:

None
Since Version:

3.0.0

Description

When flushing a memtable we currently pass a constructed MetadataCollector to the SSTableMultiWriter that is used for writing sstables. The latter may decide to split the data into multiple sstables (e.g. for separate disks or driven by compaction strategy) — if it does so, the cardinality estimation component in the reused MetadataCollector for each individual sstable contains the data for all of them.

As a result, when such sstables are compacted the estimation for the number of keys in the resulting sstables, which is used to determine the size of the bloom filter for the compaction result, is heavily overestimated.

This results in much bigger L1 bloom filters than they should be. One example (which came about during testing of the upcoming CEP-26, after insertion of 100GB data with 10% reads):
(current)

 		Bloom filter false positives: 22627369
 		Bloom filter false ratio: 0.02257
 		Bloom filter space used: 1848247864
 		Bloom filter off heap memory used: 2338964088

(fixed)

 		Bloom filter false positives: 24426545
 		Bloom filter false ratio: 0.02429
 		Bloom filter space used: 1118910096
 		Bloom filter off heap memory used: 1532357432

Attachments

Issue Links

Is contained by

CASSANDRA-18397 CEP-26: Unified Compaction Strategy

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Branimir Lambov

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Dec/22 14:19

Updated:: 20/Jul/23 14:54