[ARROW-16700] [C++] [R] [Datasets] aggregates on partitioning columns - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.0.0
Component/s: C++, R
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/32043

Description

When summarizing a whole dataset (without group_by) with an aggregate, and summarizing a partitioned column, arrow returns wrong data:

library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

df <- expand.grid(
  some_nulls = c(0L, 1L, 2L),
  year = 2010:2023,
  month = 1:12,
  day = 1:30
)

path <- tempfile()
dir.create(path)
write_dataset(df, path, partitioning = c("year", "month"))

ds <- open_dataset(path)

# with arrow the mins/maxes are off for partitioning columns
ds %>%
  summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% 
  collect()
#> # A tibble: 1 × 7
#>       n min_year min_month min_day max_year max_month max_day
#>   <int>    <int>     <int>   <int>    <int>     <int>   <int>
#> 1 15120     2023         1       1     2023        12      30

# comapred to what we get with dplyr
df %>%
  summarise(n = n(), min_year = min(year), min_month = min(month), min_day = min(day), max_year = max(year), max_month = max(month), max_day = max(day)) %>% 
  collect()
#>       n min_year min_month min_day max_year max_month max_day
#> 1 15120     2010         1       1     2023        12      30

# even min alone is off:
ds %>%
  summarise(min_year = min(year)) %>% 
  collect()
#> # A tibble: 1 × 1
#>   min_year
#>      <int>
#> 1     2016
  
# but non-partitioning columns are fine:
ds %>%
  summarise(min_day = min(day)) %>% 
  collect()
#> # A tibble: 1 × 1
#>   min_day
#>     <int>
#> 1       1
  
  
# But with a group_by, this seems ok
ds %>%
  group_by(some_nulls) %>%
  summarise(min_year = min(year)) %>% 
  collect()
#> # A tibble: 3 × 2
#>   some_nulls min_year
#>        <int>    <int>
#> 1          0     2010
#> 2          1     2010
#> 3          2     2010

Attachments

Issue Links

links to

GitHub Pull Request #13518

Activity

People

Assignee:: Jeroen van Straten

Reporter:: Jonathan Keane

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 31/May/22 20:08

Updated:: 11/Jan/23 11:45

Resolved:: 22/Jul/22 16:24

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

3h 20m