[ARROW-14736] [C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 6.0.0
Fix Version/s: None
Component/s: C++, R
Labels:
- dataset
Environment:
M1 Mac, macOS Monterey 12.0.1, 16Gb RAM
R 4.1.1, {arrow} R package 6.0.0.2 (release) & 6.0.0.9000 (dev)

External issue URL:
https://github.com/apache/arrow/issues/18944
Language:
- C++
- R

Description

Attempting to open a multi-file dataset and write a re-partitioned version of it fails as it seems there is an attempt to collect data into memory first. This happens both for wide and long data.

Steps to reproduce the issue:
1. Create a large dataset (100k columns, 300k rows) and write it to disk and create 20 copies of it. Each file will have a footprint of roughly 7.5GB.

library(arrow)
library(dplyr)
library(fs)

rows <- 300000
cols <- 100000
partitions <- 20

wide_df <- as.data.frame(
  matrix(
    sample(1:32767, rows * cols / partitions, replace = TRUE), 
    ncol = cols)
)

schem <- sapply(colnames(wide_df), function(nm) {int16()})
schem <- do.call(schema, schem)

wide_tab <- Table$create(wide_df, schema = schem)

write_parquet(wide_tab, "~/Documents/arrow_playground/wide.parquet")

fs::dir_create("~/Documents/arrow_playground/wide_ds")
for (i in seq_len(partitions)) {
  file.copy("~/Documents/arrow_playground/wide.parquet", 
            glue::glue("~/Documents/arrow_playground/wide_ds/wide-{i-1}.parquet"))
}

ds_wide <- open_dataset("~/Documents/arrow_playground/wide_ds/")

All the following steps fail:

2. Creating and writing a partitioned version of ds_wide.

  ds_wide %>%
    mutate(grouper = round(V1 / 1024)) %>%
    write_dataset("~/Documents/arrow_playground/partitioned", 
                   partitioning = "grouper",
                   format = "parquet")

3. Writing a non-partitioned dataset:

  ds_wide %>%
    write_dataset("~/Documents/arrow_playground/partitioned", 
                  format = "parquet")

4. Creating the partitioning variable first and then attempting to write:

  ds2 <- ds_wide %>% 
    mutate(grouper = round(V1 / 1024))

  ds2 %>% 
    write_dataset("~/Documents/arrow_playground/partitioned", 
                  partitioning = "grouper", 
                  format = "parquet")

5. Attempting to write to csv:

ds_wide %>% 
  write_dataset("~/Documents/arrow_playground/csv_writing/test.csv",
                format = "csv")

None of the failures seem to originate in R code and they all result in a similar behaviour: the R sessions consume increasing amounts of RAM until they crash.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2021-11-17-14-43-37-127.png
17/Nov/21 14:43
63 kB
Dragoș Moldovan-Grünfeld
image-2021-11-17-14-54-42-747.png
17/Nov/21 14:54
55 kB
Dragoș Moldovan-Grünfeld
image-2021-11-17-14-55-08-597.png
17/Nov/21 14:55
55 kB
Dragoș Moldovan-Grünfeld

Issue Links

depends upon

ARROW-14648 [C++][Dataset] Change scanner readahead limits to be based on bytes instead of number of batches

Open

ARROW-14635 [C++][Dataset] Devise a mechanism to limit the total "system ram" (process + cache) used by dataset writes

In Progress

is depended upon by

ARROW-15411 [C++][Datasets] Improve memory usage of datasets

Open

is related to

ARROW-15081 [R][C++] Arrow crashes (OOM) on R client with large remote parquet files

Open

Activity

People

Assignee:: Unassigned

Reporter:: Dragoș Moldovan-Grünfeld

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 17/Nov/21 11:25

Updated:: 11/Jan/23 08:42