[KYLIN-5640] Support to automatically adjust the Bloom Filter based on data distribution - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 5.0-alpha
Fix Version/s: 5.0-beta
Component/s: Query Engine
Labels:
None

Description

Why are the changes needed?

Now the usage of bloom filter is to specify the NDV(number of distinct values), and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is.
If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved.

What changes were proposed in this pull request?

DynamicBlockBloomFilter contains multiple BlockSplitBloomFilter as candidates and inserts values in the candidates at the same time. Use the largest bloom filter as an approximate deduplication counter, and then remove incapable bloom filter candidates during data insertion.

Attachments

Activity

People

Assignee:: Zhiting Guo

Reporter:: Zhiting Guo

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 18/Jul/23 01:34

Updated:: 23/Aug/23 08:09

Resolved:: 23/Aug/23 08:09