Status: Resolved
Resolution: Fixed
I am using a windows 10 machine, R 4.1.0, up to date R packages, and latest RStudio IDE.
Original issue report is below; here's an even more minimal example:
library(arrow) library(dplyr) td <- tempfile() dir.create(td) # if there is no partitioning in data data, this won't segfault # write_dataset(iris, td) - swap this in and won't segfault write_dataset(group_by(iris, Species), td) open_dataset(td) %>% filter(Species == "tulip") %>% group_by(Sepal.Length) %>% summarise(n = n()) %>% collect()
I was trying the new features introduced in latest arrow (6.0.2) package based on examples from the “New Directions for Apache Arrow” talk.
The RStudio IDE was crashing and the R session was aborted.
Looking closely I found that I downloaded only 2 years of data (2018 & 2019) and after the first filter (year == 2015) no data remains to be processed further.
After some debugging, by replacing the collect() function, it turns out that the summarize() is the one which function is causing the crash.
as_dataset <- open_dataset("c:/Rproj_learn/nyc-taxi/", partitioning = c("year", "month")) %>% filter(total_amount > 100 & year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 5000) %>% arrange(desc(avg_tip_pct)) %>% collect()
I would expect to get an error message (without crashing the IDE), which can be handled in code.
Another alternative result would be an empty data.frame, like in case when the parquet file was read in as a data.frame. I simulated this situation by setting a high total_amount value when filtering. Note: when using an Arrow table an error message is generated.
library(tidyverse) #> Warning: package 'tibble' was built under R version 4.1.1 #> Warning: package 'tidyr' was built under R version 4.1.1 #> Warning: package 'readr' was built under R version 4.1.1 library(arrow) #> Warning: package 'arrow' was built under R version 4.1.1 #> #> Attaching package: 'arrow' #> The following object is masked from 'package:utils': #> #> timestamp read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", as_data_frame = FALSE) %>% # filter(total_amount > 100) %>% filter(total_amount > 1e10) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 500) %>% arrange(desc(avg_tip_pct)) %>% collect() #> Error: Invalid: Must pass at least one array read_parquet("c:/Rproj_learn/nyc-taxi/2018/01/data.parquet", as_data_frame = TRUE) %>% # filter(total_amount > 100) %>% filter(total_amount > 1e10) %>% select(tip_amount, total_amount, passenger_count) %>% mutate(tip_pct = tip_amount / total_amount * 100) %>% group_by(passenger_count) %>% summarize(avg_tip_pct = mean(tip_pct), n = n()) %>% filter(n > 500) %>% arrange(desc(avg_tip_pct)) %>% collect() #> # A tibble: 0 x 3 #> # ... with 3 variables: passenger_count <int>, avg_tip_pct <dbl>, n <int>
Issue Links
- relates to
ARROW-14630 [C++] DCHECK in GroupByNode when error encountered
- Resolved
- links to