Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
The below should be a reproducible crash:
library(arrow) library(dplyr) server <- arrow::s3_bucket("ebird",endpoint_override = "minio.cirrus.carlboettiger.info") path <- server$path("Oct-2021/observations") obs <- arrow::open_dataset(path) path$ls() # observe -- 1 parquet file obs %>% count() # CRASH obs %>% to_duckdb() # also crash
I have attempted to split this large (~100 GB parquet file) into some smaller files, which helps:
path <- server$path("partitioned")
obs <- arrow::open_dataset(path)
obs$ls() # observe, multiple parquet files now
obs %>% count()
(These parquet files have also been created by arrow, btw, from a single large csv file provided by the original data provider (eBird). Unfortunately generating the partitioned versions is cumbersome as the data is very unevenly distributed, there's few columns that can avoid creating 1000s of parquet partition files and even so the bulk of the 1-billion rows fall within the same group. But all the same I think this is a bug as there's no indication why arrow cannot handle a single 100GB parquet file I think?).
Let me know if I can provide more info! I'm testing in R with latest CRAN version of arrow on a machine with 200 GB RAM.
Attachments
Issue Links
- relates to
-
ARROW-14736 [C++][R]Opening a multi-file dataset and writing a re-partitioned version of it fails
- Open