[ARROW-15081] [R][C++] Arrow crashes (OOM) on R client with large remote parquet files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: R
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/30594

Description

The below should be a reproducible crash:

library(arrow)
library(dplyr)
server <- arrow::s3_bucket("ebird",endpoint_override = "minio.cirrus.carlboettiger.info")

path <- server$path("Oct-2021/observations")
obs <- arrow::open_dataset(path)

path$ls() # observe -- 1 parquet file

obs %>% count() # CRASH

obs %>% to_duckdb() # also crash

I have attempted to split this large (~100 GB parquet file) into some smaller files, which helps:

path <- server$path("partitioned")
obs <- arrow::open_dataset(path)
obs$ls() # observe, multiple parquet files now
obs %>% count()

(These parquet files have also been created by arrow, btw, from a single large csv file provided by the original data provider (eBird). Unfortunately generating the partitioned versions is cumbersome as the data is very unevenly distributed, there's few columns that can avoid creating 1000s of parquet partition files and even so the bulk of the 1-billion rows fall within the same group. But all the same I think this is a bug as there's no indication why arrow cannot handle a single 100GB parquet file I think?).

Let me know if I can provide more info! I'm testing in R with latest CRAN version of arrow on a machine with 200 GB RAM.

Attachments

Issue Links

relates to

ARROW-14736 [C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails

Open

Activity

People

Assignee:: Unassigned

Reporter:: Carl Boettiger

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 13/Dec/21 19:18

Updated:: 11/Jan/23 08:44