Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
Description
When writing a large `tibble` to a parquet file, a large amount of memory is consumed. I first discovered this when using `targets::tar_read(obj)` to load in an object that had been saved in the parquet format. That particular object was an `sf` object with about 20 million rows and 26 columns. For a 5-6 GB object, memory ballooned by 22 GB.
I wrote the following code to test this using a regular `tibble`, not `sf`. In this test memory increases dramatically when writing, but not when reading, which I'm still trying to figure out.
library(arrow) library(dplyr) library(lobstr) library(tictoc)n <- 10000000system('free -m') tic() fake <- tibble( ID=seq(n), x=runif(n=n, min=-170, max=170), y=runif(n=n, min=-60, max=70), text1=sample(x=state.name, size=n, replace=TRUE), text2=sample(x=state.name, size=n, replace=TRUE), text3=sample(x=state.division, size=n, replace=TRUE), text4=sample(x=state.region, size=n, replace=TRUE), text5=sample(x=state.abb, size=n, replace=TRUE), num1=sample(x=state.center$x, size=n, replace=TRUE), num2=sample(x=state.center$y, size=n, replace=TRUE), num3=sample(x=state.area, size=n, replace=TRUE), Rand1=rnorm(n=n), Rand2=rnorm(n=n, mean=100, sd=3), Rand3=rbinom(n=n, size=10, prob=0.4) ) toc() system('free -m')obj_size(fake)/1024/1024/1024system('free -m') tic() write_parquet(fake, 'data/write_fake.parquet') toc() system('free -m')system('free -m') gc() system('free -m')system('free -m') tic() fake_parquet <- read_parquet('data/write_test.parquet') toc() system('free -m') obj_size(spat_parquet)/1024/1024/1024