Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16700

[C++] [R] [Datasets] aggregates on partitioning columns

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 9.0.0
    • C++, R

    Description

      When summarizing a whole dataset (without group_by) with an aggregate, and summarizing a partitioned column, arrow returns wrong data:

      library(arrow, warn.conflicts = FALSE)
      library(dplyr, warn.conflicts = FALSE)
      
      df <- expand.grid(
        some_nulls = c(0L, 1L, 2L),
        year = 2010:2023,
        month = 1:12,
        day = 1:30
      )
      
      path <- tempfile()
      dir.create(path)
      write_dataset(df, path, partitioning = c("year", "month"))
      
      ds <- open_dataset(path)
      
      # with arrow the mins/maxes are off for partitioning columns
      ds %>%
        summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% 
        collect()
      #> # A tibble: 1 × 7
      #>       n min_year min_month min_day max_year max_month max_day
      #>   <int>    <int>     <int>   <int>    <int>     <int>   <int>
      #> 1 15120     2023         1       1     2023        12      30
      
      # comapred to what we get with dplyr
      df %>%
        summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% 
        collect()
      #>       n min_year min_month min_day max_year max_month max_day
      #> 1 15120     2010         1       1     2023        12      30
      
      # even min alone is off:
      ds %>%
        summarise(min_year = min(year)) %>% 
        collect()
      #> # A tibble: 1 × 1
      #>   min_year
      #>      <int>
      #> 1     2016
        
      # but non-partitioning columns are fine:
      ds %>%
        summarise(min_day = min(day)) %>% 
        collect()
      #> # A tibble: 1 × 1
      #>   min_day
      #>     <int>
      #> 1       1
        
        
      # But with a group_by, this seems ok
      ds %>%
        group_by(some_nulls) %>%
        summarise(min_year = min(year)) %>% 
        collect()
      #> # A tibble: 3 × 2
      #>   some_nulls min_year
      #>        <int>    <int>
      #> 1          0     2010
      #> 2          1     2010
      #> 3          2     2010
      

      Attachments

        Issue Links

          Activity

            People

              jvanstraten Jeroen van Straten
              jonkeane Jonathan Keane
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m