[ARROW-12728] [C++][Compute] Implement count_distinct/distinct hash aggregate kernels - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 4.0.0
Fix Version/s: 6.0.0
Component/s: C++
Labels:
- kernel
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/28470

Description

Implement count distinct aggregate reusing hash table from hash group by inside of it.

This brings support to SQL queries like:
select a, count(distinct b), count(distinct c) from t group by a

For instance to compute count(distinct b), the first group id mapping will give group id based on column a value; then the second group id mapping is done using the key (groupid(a), b) inside count(distinct b) aggregate (similarly for count(distinct c)).
After all input rows are consumed, the final processing step scans the hash tables based on (groupid(a), b) and updates an array of counts indexed by groupid(a).
The resulting array of counts represents the output of count distinct aggregate.

Attachments

Issue Links

is a child of

ARROW-12633 [C++] Query engine umbrella issue

Open

ARROW-13339 [C++] Implement hash_aggregate kernels (umbrella issue)

Open

is depended upon by

ARROW-13620 [R] Binding for n_distinct()

Resolved

is related to

ARROW-14035 [C++][Compute] Implement non-hash count_distinct aggregate kernel

Resolved

links to

GitHub Pull Request #10876

Activity

People

Assignee:: David Li

Reporter:: Michal Nowakiewicz

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 10/May/21 18:16

Updated:: 11/Jan/23 08:28

Resolved:: 25/Aug/21 20:49

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 50m