Details
-
Improvement
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
This might be "not a bug" but I wonder if we can do something better here. When I create and execute a dplyr query there is a bunch of RAM that is left allocated until the next GC pass.
Since R's garbage collection is only based on RAM that R has allocated this extra memory (which can be quite substantial) might never be freed.
Perhaps we should just manually trigger a gc pass after running an execution plan? Or it may be good to get a better understanding of what exactly this memory is being used for.
In the example below I load ~2GB of data but after the collect there is ~3GB used. I wait 10 seconds to ensure it's not jemalloc. Then I run gc() manually and ~1GB is freed.
> dataset = arrow::open_dataset('/home/pace/dev/data/dataset/parquet/5') > default_memory_pool()$bytes_allocated [1] 64 > x <- dataset %>% collect(as_data_frame=FALSE) > arrow::default_memory_pool()$bytes_allocated [1] 2921135104 > Sys.sleep(10) > arrow::default_memory_pool()$bytes_allocated [1] 2921135104 > gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 917099 49.0 1498168 80.1 1498168 80.1 Vcells 1649894 12.6 8388608 64.0 2617403 20.0 > arrow::default_memory_pool()$bytes_allocated [1] 2028716480
Attachments
Issue Links
- is related to
-
ARROW-17002 [R] dplyr queries create locks on FileSystemDataset files
- Open