Details
-
Improvement
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
None
Description
ARROW-13344 enabled the dplyr verb summarise() to use the Arrow engine but kept this off by default, controlled by the arrow.debug option.
Before this can be turned on by default, we should ensure that the following are all implemented:
- a sufficient set of hash aggregate kernels and R aggregate function mappings to them, covering the vast majority of all aggregate functions that dplyr users call in summarise() (add any additional required ones to ARROW-13339)
- support for a sufficient set of data types in aggregates
- support for a sufficient set of data types in grouping columns
- handling of NA and NaN values in aggregates and the na.rm option consistent with base R and dplyr (
ARROW-13497and possibly other issues) - handling of NA and NaN values in grouping columns consistent with dplyr
- handling empty or bad input to summarise() (
ARROW-13543) - many new tests to confirm equivalent results from a variety of group_by() %>% summarise() queries on data frames and on Arrow data
- resolution of various related bugs
Attachments
Issue Links
- is blocked by
-
ARROW-13497 [C++][R] FunctionOptions not used by aggregation nodes
- Resolved
-
ARROW-13499 [R] Aggregation on expression doesn't NSE correctly
- Resolved
-
ARROW-13543 [R] Handle summarize() with 0 arguments or no aggregate functions
- Resolved
-
ARROW-13502 [R] Bindings for min/max aggregation
- Resolved
-
ARROW-13550 [R] Support .groups argument to dplyr::summarize()
- Resolved
-
ARROW-13501 [R] Bindings for count aggregation
- Resolved
-
ARROW-13528 [R] Bindings for mean, var, sd aggregation
- Resolved
-
ARROW-13772 [R] Binding for median() and quantile() aggregation functions
- Resolved
-
ARROW-13777 [R] mutate after group_by should be ok as long as there are only scalar functions
- Resolved
-
ARROW-13778 [R] Handle complex summarize expressions
- Resolved
-
ARROW-13691 [C++] Add option to handle NAs to VarianceOptions
- Resolved
-
ARROW-13740 [R] summarize() should not eagerly evaluate
- Resolved
-
ARROW-13764 [C++] Implement ScalarAggregateOptions for count_distinct (grouped)
- Resolved
- relates to
-
ARROW-13344 [R] Initial bindings for ExecPlan/ExecNode
- Resolved