Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16452

[R] After dataset scan, some RAM is left consumed until a garbage collection pass

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • R
    • None

    Description

      This might be "not a bug" but I wonder if we can do something better here. When I create and execute a dplyr query there is a bunch of RAM that is left allocated until the next GC pass.

      Since R's garbage collection is only based on RAM that R has allocated this extra memory (which can be quite substantial) might never be freed.

      Perhaps we should just manually trigger a gc pass after running an execution plan? Or it may be good to get a better understanding of what exactly this memory is being used for.

      In the example below I load ~2GB of data but after the collect there is ~3GB used. I wait 10 seconds to ensure it's not jemalloc. Then I run gc() manually and ~1GB is freed.

      > dataset = arrow::open_dataset('/home/pace/dev/data/dataset/parquet/5')
      > default_memory_pool()$bytes_allocated
      [1] 64
      > x <- dataset %>% collect(as_data_frame=FALSE)
      > arrow::default_memory_pool()$bytes_allocated
      [1] 2921135104
      > Sys.sleep(10)
      > arrow::default_memory_pool()$bytes_allocated
      [1] 2921135104
      > gc()
                used (Mb) gc trigger (Mb) max used (Mb)
      Ncells  917099 49.0    1498168 80.1  1498168 80.1
      Vcells 1649894 12.6    8388608 64.0  2617403 20.0
      > arrow::default_memory_pool()$bytes_allocated
      [1] 2028716480
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated: