Details
Description
Attempting to open a multi-file dataset and write a re-partitioned version of it fails as it seems there is an attempt to collect data into memory first. This happens both for wide and long data.
Steps to reproduce the issue:
1. Create a large dataset (100k columns, 300k rows) and write it to disk and create 20 copies of it. Each file will have a footprint of roughly 7.5GB.
library(arrow) library(dplyr) library(fs) rows <- 300000 cols <- 100000 partitions <- 20 wide_df <- as.data.frame( matrix( sample(1:32767, rows * cols / partitions, replace = TRUE), ncol = cols) ) schem <- sapply(colnames(wide_df), function(nm) {int16()}) schem <- do.call(schema, schem) wide_tab <- Table$create(wide_df, schema = schem) write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet") fs::dir_create("~/Documents/arrow_playground/wide_ds") for (i in seq_len(partitions)) { file.copy("~/Documents/arrow_playground/wide.parquet", glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet")) } ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/")
All the following steps fail:
2. Creating and writing a partitioned version of ds_wide.
ds_wide %>% mutate(grouper = round(V1 / 1024)) %>% write_dataset("~/Documents/arrow_playground/partitioned", partitioning = "grouper", format = "parquet")
3. Writing a non-partitioned dataset:
ds_wide %>% write_dataset("~/Documents/arrow_playground/partitioned", format = "parquet")
4. Creating the partitioning variable first and then attempting to write:
ds2 <- ds_wide %>% mutate(grouper = round(V1 / 1024)) ds2 %>% write_dataset("~/Documents/arrow_playground/partitioned", partitioning = "grouper", format = "parquet")
5. Attempting to write to csv:
ds_wide %>% write_dataset("~/Documents/arrow_playground/csv_writing/test.csv", format = "csv")
None of the failures seem to originate in R code and they all result in a similar behaviour: the R sessions consume increasing amounts of RAM until they crash.
Attachments
Attachments
Issue Links
- depends upon
-
ARROW-14648 [C++][Dataset] Change scanner readahead limits to be based on bytes instead of number of batches
- Open
-
ARROW-14635 [C++][Dataset] Devise a mechanism to limit the total "system ram" (process + cache) used by dataset writes
- In Progress
- is depended upon by
-
ARROW-15411 [C++][Datasets] Improve memory usage of datasets
- Open
- is related to
-
ARROW-15081 [R][C++] Arrow crashes (OOM) on R client with large remote parquet files
- Open