[ARROW-13627] [C++] ScalarAggregateOptions don't make sense (in hash aggregation) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.0.0
Component/s: C++
Labels:

External issue URL:
https://github.com/apache/arrow/issues/29266

Description

R's aggregation functions have a na.rm argument that governs how missing data is handled. Assume x <- c(1, 2, NA, 3). sum(x, na.rm = TRUE) == 6 and sum(x, na.rm = FALSE) is NA because there is at least one missing value.

The ScalarAggregateOptions have two options: skip_nulls and min_count. From what I can tell reading the source, in the context of sum(), skip_nulls affects whether each element of the Array is added to "count", and if count < min_count, you get a null value returned. So to get the expected behavior when calling "sum" on an Array, when na.rm = TRUE, we pass skip_nulls = false, min_count = 0. When na.rm = FALSE, we pass skip_nulls = true, min_count = length, the reasoning being that you return a null value unless all values are non-null (and count == length). See https://github.com/apache/arrow/blob/master/r/R/compute.R#L125-L130

This doesn't really work in the query engine, though. We don't know how many rows are in the data to set an appropriate min_count to get the expected behavior--the dataset being queried may have filtering. And when doing hash aggregation, each group may have a different number of rows.

Attachments

Issue Links

blocks

ARROW-13497 [C++][R] FunctionOptions not used by aggregation nodes

Resolved

links to

GitHub Pull Request #10942

Activity

People

Assignee:: David Li

Reporter:: Neal Richardson

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 13/Aug/21 19:02

Updated:: 11/Jan/23 08:34

Resolved:: 23/Aug/21 17:27

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: