Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-10080

[R] Arrow does not release unused memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.1
    • 3.0.0
    • R
    • Linux, Windows

    Description

      I’m having problems when collect()-ing Arrow data sources into data frames that are close in size to the available memory on the machine. Consider the following workflow. I have a dataset which I want to query so that at some point in needs to be collect()-ed but at the same I’m also reducing the result. During the intermediate step the entire data frame fits into memory, and the following code runs without any problems.

      test_ds <- "memory_test"
      
      ds1 <- open_dataset(test_ds) %>%
        collect() %>%
        dim()
      

      However, running the same code in the same R session again fails with R running out of memory.

      ds1 <- open_dataset(test_ds) %>%
        collect() %>%
        dim()
      

      The example might be a but contrived but you can easily imagine a workflow where different queries are ran on a dataset and the reduced results are stored.

      As far as I understand, R is a garbage collected language, and in this case there aren’t any references left to large objects in memory. And indeed, the second query succeeds when manually forcing a garbage collection.

      Is this the expected behaviour from Arrow?

      I know, this is quite hard to reproduce, as the exact dataset size required to trigger this behaviour depends on the particular machine but I prepared a reproducible example in this gist, that should give the same result on Ubuntu 20.04 with 1GB RAM and no swap. See attachment for sessionInfo() output. I ran it on a Digitalocean s-1vcpu-1gb droplet.

      First, let’s create a a partitioned Arrow dataset:

      $ Rscript ds_prep.R 1000000 5
      

      The first command line argument gives the number of rows in each partition, and second gives the number of partitions. The parameters are set so that the entire dataset should fit into memory.

      Then running the two queries fails:

      $ Rscript ds_read.R
      Running query, 1st try...
      ds size, 1st run: 56
      Running query, 2nd try...
      [1]    11151 killed     Rscript ds_read.R
      

      However, when forcing a gc() (which I’m controlling here with a command line argument), it succeeds:

      $ Rscript ds_read.R 1
      Running query, 1st try...
      ds size, 1st run: 56
      running gc() ...
                used (Mb) gc trigger  (Mb) max used  (Mb)
      Ncells  703052 37.6    1571691  84.0  1038494  55.5
      Vcells 1179578  9.0   36405636 277.8 41188956 314.3
      Running query, 2nd try...
      ds size, 2nd run: 56
      

      In general, one shouldn’t have to use gc() manually. Interestingly, setting R’s garbage collection more aggressive (see ?Memory) doesn’t help either:

      $ R_GC_MEM_GROW=0 Rscript ds_read.R
      Running query, 1st try...
      ds size, 1st run: 56
      Running query, 2nd try...
      [1]    11422 killed     Rscript ds_read.R
      

      I didn’t try to reproduce this problem on macOS, as my Mac would probably start swapping furiously but I managed to reproduce it on a Windows 7 machine with practically no swap. Of course the parameters are different, and the error messages are presumably system specific.

      $ Rscript ds_prep.R 1000000 40
      $ Rscript ds_read.R
      Running query, 1st try...
      ds size, 1st run: 56
      Running query, 2nd try...
      Error in dataset___Scanner__ToTable(self) :
        IOError: Out of memory: malloc of size 524288 failed
      Calls: collect ... shared_ptr -> shared_ptr_is_null -> dataset___Scanner__ToTable
      Execution halted
      $ Rscript ds_read.R 1
      Running query, 1st try...
      ds size, 1st run: 56
      running gc() ...
                used (Mb) gc trigger   (Mb)  max used (Mb)
      Ncells  688789 36.8    1198030   64.0   1198030   64
      Vcells 1109451  8.5  271538343 2071.7 321118845 2450
      Running query, 2nd try...
      ds size, 2nd run: 56
      $ R_GC_MEM_GROW=0 Rscript ds_read.R
      Running query, 1st try...
      ds size, 1st run: 56
      Running query, 2nd try...
      Error in dataset___Scanner__ToTable(self) :
        IOError: Out of memory: malloc of size 524288 failed
      Calls: collect ... shared_ptr -> shared_ptr_is_null -> dataset___Scanner__ToTable
      Execution halted
      

      Attachments

        1. sessioninfo.txt
          1 kB
          András Svraka

        Issue Links

          Activity

            People

              bkietz Ben Kietzman
              svraka András Svraka
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h
                  4h