Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
7.0.0
Description
Unless grouped before `dplyr::count` returns a ungrouped data.frame. The arrow implement preserves the grouping variables:
library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) tf1 <- tempfile() dir.create(tf1) starwars |> write_dataset(tf1) # no group ---------------------------------------------------------------- ## dplyr behaviour count_dplyr_no_group <- starwars %>% count(gender, homeworld, species) group_vars(count_dplyr_no_group) #> character(0) ## arrow behaviour count_arrow_no_group <- open_dataset(tf1) %>% count(gender, homeworld, species) %>% collect() group_vars(count_arrow_no_group) #> [1] "gender" "homeworld"
If I am correct that this is a undesired behaviour I think it can be fixed here using this patch:
count.arrow_dplyr_query <- function(x, ..., wt = NULL, sort = FALSE, name = NULL) { if (!missing(...)) { out <- dplyr::group_by(x, ..., .add = TRUE) } else { out <- x } out <- dplyr::tally(out, wt = {{ wt }}, sort = sort, name = name) gv <- dplyr::group_vars(x) if (rlang::is_empty(gv)) { out <- dplyr::ungroup(out) } else { # Restore original group vars out$group_by_vars <- gv } out }
I can submit a PR with some tests if that would be helpful.
Attachments
Issue Links
- links to