Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15679

[R] count should return an ungrouped dataframe

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 7.0.0
    • 8.0.0
    • R

    Description

      Unless grouped before `dplyr::count` returns a ungrouped data.frame. The arrow implement preserves the grouping variables:

       

      library(arrow, warn.conflicts = FALSE)
      library(dplyr, warn.conflicts = FALSE)
      tf1 <- tempfile()
      dir.create(tf1)
      starwars |>
        write_dataset(tf1)
      
      # no group ----------------------------------------------------------------
      ## dplyr behaviour
      count_dplyr_no_group <- starwars %>%
        count(gender, homeworld, species)
      group_vars(count_dplyr_no_group)
      #> character(0)
      ## arrow behaviour
      count_arrow_no_group <- open_dataset(tf1) %>%
        count(gender, homeworld, species) %>%
        collect()
      group_vars(count_arrow_no_group)
      #> [1] "gender"    "homeworld"
      

      If I am correct that this is a undesired behaviour I think it can be fixed here using this patch:

       

      count.arrow_dplyr_query <- function(x, ..., wt = NULL, sort = FALSE, name = NULL) {
        if (!missing(...)) {
          out <- dplyr::group_by(x, ..., .add = TRUE)
        } else {
          out <- x
        }
        out <- dplyr::tally(out, wt = {{ wt }}, sort = sort, name = name)
      
        gv <- dplyr::group_vars(x)
        if (rlang::is_empty(gv)) {
          out <- dplyr::ungroup(out)
        } else {
          # Restore original group vars
          out$group_by_vars <- gv
        }
        out
      }
      

       

      I can submit a PR with some tests if that would be helpful.

      Attachments

        Issue Links

          Activity

            People

              boshek Sam Albers
              boshek Sam Albers
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m