[ARROW-15239] [C++][Compute] Introduce Bloom filters to hash join - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 6.0.0
Fix Version/s: 8.0.0
Component/s: C++
Labels:
- pull-request-available
- query-engine

External issue URL:
https://github.com/apache/arrow/issues/30736

Description

Bloom filters are a common way to improve performance of hash joins where many rows on the probe side of the hash join do not have matches on the build side. Bloom filters are often able to reduce the cost of eliminating such rows early in the processing pipeline, since they are cheaper to probe than the hash join hash table, but they can return false positives for a reasonably small percentage of inputs.

This task is about introducing a data structure of register blocked Bloom filter implementation (a practical modification of Bloom filter concept that is specifically tuned for use in query processing related to hash joins and both more space efficient and less costly than using hash table for filtering). The data structure should provide functionality for parallel construction from a vector of exec batches accumulated in memory and vectorized lookup and filtering for a single exec batch. It should not have a limit on the size of the Bloom filter (the number of inserted hashes), which requires using 64-bit hashes for larger inputs. It should be verified that build and probe costs are reasonable low and false positives rate is at most few percent (which should be acceptable in use for query processing).

Attachments

Issue Links

blocks

ARROW-15498 [C++][Compute] Implement Bloom filter pushdown between hash joins

Resolved

is a child of

ARROW-12633 [C++] Query engine umbrella issue

Open

links to

GitHub Pull Request #12067

Activity

People

Assignee:: Michal Nowakiewicz

Reporter:: Michal Nowakiewicz

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 03/Jan/22 20:37

Updated:: 11/Jan/23 08:45

Resolved:: 24/Mar/22 08:19

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

10.5h