[ARROW-14479] [C++][Compute] Hash Join microbenchmarks - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 7.0.0
Fix Version/s: 7.0.0
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/30038

Description

Implement a series of microbenchmarks giving a good picture of the performance of hash join implemented in Arrow across different set of dimensions.
Compare the performance against some other product(s).
Add scripts for generating useful visual reports giving a good picture of the costs of hash join.

Examples of dimensions to explore in microbenchmarks:

number of duplicate keys on build side
relative size of build side to probe side
selectivity of the join
number of key columns
number of payload columns
filtering performance for semi- and anti- joins
dense integer key vs sparse integer key vs string key
build size
scaling of build, filtering, probe
inner vs left outer, inner vs right outer
left semi vs right semi, left anti vs right anti, left outer vs right outer
non-uniform key distribution
monotonic key values in input, partitioned key values in input (with and without per batch min-max metadata)
chain of multiple hash joins
overhead of Bloom filter for non-selective Bloom filter

Attachments

Issue Links

is a child of

ARROW-14182 [C++][Compute] Hash Join performance improvement

Resolved

links to

GitHub Pull Request #11876

Activity

People

Assignee:: Sasha Krassovsky

Reporter:: Michal Nowakiewicz

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Oct/21 19:01

Updated:: 11/Jan/23 08:40

Resolved:: 13/Jan/22 02:36

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

9.5h