[ARROW-14583] [R][C++] Crash when summarizing after filtering to no rows on partitioned data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 6.0.0
Fix Version/s: 6.0.1, 7.0.0
Component/s: C++, R
Labels:
- pull-request-available
- query-engine
Environment:
I am using a windows 10 machine, R 4.1.0, up to date R packages, and latest RStudio IDE.

External issue URL:
https://github.com/apache/arrow/issues/30131

Description

Original issue report is below; here's an even more minimal example:

library(arrow)
library(dplyr)
td <- tempfile()
dir.create(td)
# if there is no partitioning in data data, this won't segfault
# write_dataset(iris, td) - swap this in and won't segfault
write_dataset(group_by(iris, Species), td)
open_dataset(td) %>%
  filter(Species == "tulip") %>%
  group_by(Sepal.Length) %>%
  summarise(n = n()) %>%
  collect()

I was trying the new features introduced in latest arrow (6.0.2) package based on examples from the “New Directions for Apache Arrow” talk.

The RStudio IDE was crashing and the R session was aborted.

Looking closely I found that I downloaded only 2 years of data (2018 & 2019) and after the first filter (year == 2015) no data remains to be processed further.

After some debugging, by replacing the collect() function, it turns out that the summarize() is the one which function is causing the crash.

as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", 
                                partitioning = c("year", "month")) %>%
  filter(total_amount > 100 & year == 2015) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 5000) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect()

I would expect to get an error message (without crashing the IDE), which can be handled in code.

Another alternative result would be an empty data.frame, like in case when the parquet file was read in as a data.frame. I simulated this situation by setting a high total_amount value when filtering. Note: when using an Arrow table an error message is generated.

 library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.1
#> Warning: package 'tidyr' was built under R version 4.1.1
#> Warning: package 'readr' was built under R version 4.1.1
library(arrow)
#> Warning: package 'arrow' was built under R version 4.1.1
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp

read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
             as_data_frame = FALSE) %>%
  # filter(total_amount > 100) %>%
  filter(total_amount > 1e10) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 500) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect()

#> Error: Invalid: Must pass at least one array


read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", 
             as_data_frame = TRUE) %>%
  # filter(total_amount > 100) %>%
  filter(total_amount > 1e10) %>%
  select(tip_amount, total_amount, passenger_count) %>%
  mutate(tip_pct = tip_amount / total_amount * 100) %>%
  group_by(passenger_count) %>%
  summarize(avg_tip_pct = mean(tip_pct), n = n()) %>%
  filter(n > 500) %>%
  arrange(desc(avg_tip_pct)) %>%
  collect()

#> # A tibble: 0 x 3
#> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>

Attachments

Issue Links

relates to

ARROW-14630 [C++] DCHECK in GroupByNode when error encountered

Resolved

links to

GitHub Pull Request #11623

Activity

People

Assignee:: David Li

Reporter:: Zsolt Kegyes-Brassai

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 04/Nov/21 07:05

Updated:: 11/Jan/23 08:40

Resolved:: 08/Nov/21 19:33

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 10m