Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-12529

[R] Writing to Parquet from tibble Consumes Large Amount of Memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • R
    • None

    Description

      When writing a large `tibble` to a parquet file, a large amount of memory is consumed. I first discovered this when using `targets::tar_read(obj)` to load in an object that had been saved in the parquet format. That particular object was an `sf` object with about 20 million rows and 26 columns. For a 5-6 GB object, memory ballooned by 22 GB.

      I wrote the following code to test this using a regular `tibble`, not `sf`. In this test memory increases dramatically when writing, but not when reading, which I'm still trying to figure out.

      library(arrow)
      library(dplyr)
      library(lobstr)
      library(tictoc)n <- 10000000system('free -m')
      tic()
      fake <- tibble(
          ID=seq(n),
          x=runif(n=n, min=-170, max=170),
          y=runif(n=n, min=-60, max=70),
          text1=sample(x=state.name, size=n, replace=TRUE),
          text2=sample(x=state.name, size=n, replace=TRUE),
          text3=sample(x=state.division, size=n, replace=TRUE),
          text4=sample(x=state.region, size=n, replace=TRUE),
          text5=sample(x=state.abb, size=n, replace=TRUE),
          num1=sample(x=state.center$x, size=n, replace=TRUE),
          num2=sample(x=state.center$y, size=n, replace=TRUE),
          num3=sample(x=state.area, size=n, replace=TRUE),
          Rand1=rnorm(n=n),
          Rand2=rnorm(n=n, mean=100, sd=3),
          Rand3=rbinom(n=n, size=10, prob=0.4)
      )
      toc()
      system('free -m')obj_size(fake)/1024/1024/1024system('free -m')
      tic()
      write_parquet(fake, 'data/write_fake.parquet')
      toc()
      system('free -m')system('free -m')
      gc()
      system('free -m')system('free -m')
      tic()
      fake_parquet <- read_parquet('data/write_test.parquet')
      toc()
      system('free -m')
      obj_size(spat_parquet)/1024/1024/1024
      
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            jaredlander Jared Lander
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: