[ARROW-12725] [C++][Compute] GroupBy: improve performance by encoding keys in row format only when they are inserted into hash table - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 6.0.0
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/28467

Description

Previous implementation of hash group by converts input ExecBatches to row-oriented format,
then hashes and compares rows as if they were a single column.
It is more efficient (especially for small number of key columns) to avoid relatively costly
encoding and instead compute hashes of individual columns in column-oriented format mixing them together, and similarly comparing column-oriented data to row-oriented data in the hash table without converting.
Encoding only happens for a subset of input rows that are inserted into the hash table - they introduce new groups.
Keys in hash table remain stored as row-oriented.

Attachments

Issue Links

is a child of

ARROW-12633 [C++] Query engine umbrella issue

Open

links to

GitHub Pull Request #10290

Activity

People

Assignee:: Michal Nowakiewicz

Reporter:: Michal Nowakiewicz

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/May/21 18:03

Updated:: 11/Jan/23 08:28

Resolved:: 30/Aug/21 17:42

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: