[ARROW-10080] [R] Arrow does not release unused memory - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.1
Fix Version/s: 3.0.0
Component/s: R
Labels:
- pull-request-available
Environment:
Linux, Windows

External issue URL:
https://github.com/apache/arrow/issues/26097

Description

I’m having problems when collect()-ing Arrow data sources into data frames that are close in size to the available memory on the machine. Consider the following workflow. I have a dataset which I want to query so that at some point in needs to be collect()-ed but at the same I’m also reducing the result. During the intermediate step the entire data frame fits into memory, and the following code runs without any problems.

test_ds <- "memory_test"

ds1 <- open_dataset(test_ds) %>%
  collect() %>%
  dim()

However, running the same code in the same R session again fails with R running out of memory.

ds1 <- open_dataset(test_ds) %>%
  collect() %>%
  dim()

The example might be a but contrived but you can easily imagine a workflow where different queries are ran on a dataset and the reduced results are stored.

As far as I understand, R is a garbage collected language, and in this case there aren’t any references left to large objects in memory. And indeed, the second query succeeds when manually forcing a garbage collection.

Is this the expected behaviour from Arrow?

I know, this is quite hard to reproduce, as the exact dataset size required to trigger this behaviour depends on the particular machine but I prepared a reproducible example in this gist, that should give the same result on Ubuntu 20.04 with 1GB RAM and no swap. See attachment for sessionInfo() output. I ran it on a Digitalocean s-1vcpu-1gb droplet.

First, let’s create a a partitioned Arrow dataset:

$ Rscript ds_prep.R 1000000 5

The first command line argument gives the number of rows in each partition, and second gives the number of partitions. The parameters are set so that the entire dataset should fit into memory.

Then running the two queries fails:

$ Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
[1]    11151 killed     Rscript ds_read.R

However, when forcing a gc() (which I’m controlling here with a command line argument), it succeeds:

$ Rscript ds_read.R 1
Running query, 1st try...
ds size, 1st run: 56
running gc() ...
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  703052 37.6    1571691  84.0  1038494  55.5
Vcells 1179578  9.0   36405636 277.8 41188956 314.3
Running query, 2nd try...
ds size, 2nd run: 56

In general, one shouldn’t have to use gc() manually. Interestingly, setting R’s garbage collection more aggressive (see ?Memory) doesn’t help either:

$ R_GC_MEM_GROW=0 Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
[1]    11422 killed     Rscript ds_read.R

I didn’t try to reproduce this problem on macOS, as my Mac would probably start swapping furiously but I managed to reproduce it on a Windows 7 machine with practically no swap. Of course the parameters are different, and the error messages are presumably system specific.

$ Rscript ds_prep.R 1000000 40
$ Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
Error in dataset___Scanner__ToTable(self) :
  IOError: Out of memory: malloc of size 524288 failed
Calls: collect ... shared_ptr -> shared_ptr_is_null -> dataset___Scanner__ToTable
Execution halted
$ Rscript ds_read.R 1
Running query, 1st try...
ds size, 1st run: 56
running gc() ...
          used (Mb) gc trigger   (Mb)  max used (Mb)
Ncells  688789 36.8    1198030   64.0   1198030   64
Vcells 1109451  8.5  271538343 2071.7 321118845 2450
Running query, 2nd try...
ds size, 2nd run: 56
$ R_GC_MEM_GROW=0 Rscript ds_read.R
Running query, 1st try...
ds size, 1st run: 56
Running query, 2nd try...
Error in dataset___Scanner__ToTable(self) :
  IOError: Out of memory: malloc of size 524288 failed
Calls: collect ... shared_ptr -> shared_ptr_is_null -> dataset___Scanner__ToTable
Execution halted

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

sessioninfo.txt
24/Sep/20 10:07
1 kB
András Svraka

Issue Links

is related to

ARROW-9903 [R] open_dataset freezes opening feather files on Windows

Resolved

links to

GitHub Pull Request #8533

Activity

People

Assignee:: Ben Kietzman

Reporter:: András Svraka

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 24/Sep/20 10:11

Updated:: 11/Jan/23 08:10

Resolved:: 30/Oct/20 01:05

Time Tracking

Estimated:

Not Specified

Remaining:

Logged: