Details
Description
After ARROW-15260 I observe a big memory and compute time increases with basic sumarize queries. My use case shows almost 10x memory and 10x computation time increases in some cases.
Here is a less dramatic replication along my real use case which gives 2x time increase:
library(arrow) dir.create(dir <- "/tmp/iris", showWarnings = F) for (day in seq_len(100)) { dir.create(glue("{dir}/day={day}"), showWarnings = F) for (i in seq_len(10)) { dfs <- map(seq_len(20), function(j) { names(iris) <- paste0(names(iris), j) iris }) df <- dplyr::bind_cols(!!!dfs) write_parquet(df, glue("{dir}/day={day}/{i}.parquet")) } } library(arrow) system.time( open_dataset("/tmp/iris") %>% group_by(day, Species1) %>% summarise(N = n(), .groups = "drop") %>% collect())
Before commit 838687178: 0.33sec, after: 0.73sec.
If I put back the schema Binding which was removed here I get the performance back.
Attachments
Issue Links
- is fixed by
-
ARROW-17556 [C++] Unbound scan projection expression leads to all fields being loaded
- Resolved