Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14020

[R] Writing datafames with list columns is slow and scales poorly with nesting level

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 5.0.0
    • 6.0.0
    • R
    • Windows 10 x64

    Description

      Writing data frames that contain list columns seems much slower than expected:

      ``` r
      library(tidyverse)
      #> Warning: package 'tidyverse' was built under R version 4.1.1
      #> Warning: package 'tibble' was built under R version 4.1.1
      #> Warning: package 'readr' was built under R version 4.1.1
      library(arrow)
      #> Warning: package 'arrow' was built under R version 4.1.1
      #>
      #> Attaching package: 'arrow'
      #> The following object is masked from 'package:utils':
      #>
      #> timestamp
      dummy <- tibble(
      points = rep(list(seq(6)), 2e6),
      index = seq(2e6)
      )

      1. very slooooooow
        system.time(write_parquet(dummy, "dummy.parquet"))
        #> user system elapsed
        #> 55.64 0.11 55.98

      dummy_txt <- mutate(dummy, points = map_chr(points, deparse))

      1. orders of magnitude faster
        system.time(write_parquet(dummy_txt, "dummytext.parquet"))
        #> user system elapsed
        #> 0.24 0.02 0.25
        ```

      <sup>Created on 2021-09-17 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>

      <details style="margin-bottom:10px;">

      <summary>Session info</summary>

      ``` r
      sessioninfo::session_info()
      #> - Session info ---------------------------------------------------------------
      #> setting value
      #> version R version 4.1.0 (2021-05-18)
      #> os Windows 10 x64
      #> system x86_64, mingw32
      #> ui RTerm
      #> language (EN)
      #> collate English_Australia.1252
      #> ctype English_Australia.1252
      #> tz Australia/Brisbane
      #> date 2021-09-17
      #>
      #> - Packages -------------------------------------------------------------------
      #> package * version date lib source
      #> arrow * 5.0.0.2 2021-09-05 [1] CRAN (R 4.1.1)
      #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
      #> backports 1.2.1 2020-12-09 [1] CRAN (R 4.1.0)
      #> bit 4.0.4 2020-08-04 [1] CRAN (R 4.1.0)
      #> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.1.0)
      #> broom 0.7.7 2021-06-13 [1] CRAN (R 4.1.0)
      #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.1.0)
      #> cli 3.0.1 2021-07-17 [1] CRAN (R 4.1.0)
      #> colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.0)
      #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.1.0)
      #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.1.0)
      #> dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.1.0)
      #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
      #> dplyr * 1.0.7 2021-06-18 [1] CRAN (R 4.1.0)
      #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
      #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
      #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0)
      #> forcats * 0.5.1 2021-01-27 [1] CRAN (R 4.1.0)
      #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
      #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.1.0)
      #> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
      #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
      #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
      #> haven 2.4.1 2021-04-23 [1] CRAN (R 4.1.0)
      #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
      #> hms 1.1.0 2021-05-17 [1] CRAN (R 4.1.0)
      #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
      #> httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
      #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.0)
      #> knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
      #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
      #> lubridate 1.7.10 2021-02-26 [1] CRAN (R 4.1.0)
      #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
      #> modelr 0.1.8 2020-05-19 [1] CRAN (R 4.1.0)
      #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
      #> pillar 1.6.2 2021-07-29 [1] CRAN (R 4.1.0)
      #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
      #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
      #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
      #> Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.1.0)
      #> readr * 2.0.1 2021-08-10 [1] CRAN (R 4.1.1)
      #> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.1.0)
      #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
      #> rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
      #> rmarkdown 2.9 2021-06-15 [1] CRAN (R 4.1.0)
      #> rvest 1.0.1 2021-07-26 [1] CRAN (R 4.1.0)
      #> scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
      #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
      #> stringi 1.7.4 2021-08-25 [1] CRAN (R 4.1.1)
      #> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
      #> styler 1.4.1 2021-03-30 [1] CRAN (R 4.1.0)
      #> tibble * 3.1.4 2021-08-25 [1] CRAN (R 4.1.1)
      #> tidyr * 1.1.3 2021-03-03 [1] CRAN (R 4.1.0)
      #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0)
      #> tidyverse * 1.3.1 2021-04-15 [1] CRAN (R 4.1.1)
      #> tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.1.0)
      #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0)
      #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
      #> withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
      #> xfun 0.24 2021-06-15 [1] CRAN (R 4.1.0)
      #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
      #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
      #>
      #> [1] C:/Users/msmcbain/libs/R
      #> [2] C:/R/R-4.1.0/library
      ```

      </details>

      In this case it's actually faster to convert the list columns to text and do the write, than to write with the list columns. 

      This issue also affects write_arrow:

      ``` r
      library(tidyverse)
      #> Warning: package 'tidyverse' was built under R version 4.1.1
      #> Warning: package 'tibble' was built under R version 4.1.1
      #> Warning: package 'readr' was built under R version 4.1.1
      library(arrow)
      #> Warning: package 'arrow' was built under R version 4.1.1
      #>
      #> Attaching package: 'arrow'
      #> The following object is masked from 'package:utils':
      #>
      #> timestamp
      dummy <- tibble(
      points = rep(list(seq(6)), 2e6),
      index = seq(2e6)
      )

      1. very slooooooow
        system.time(write_arrow(dummy, "dummy.parquet"))
        #> Warning: Use 'write_ipc_stream' or 'write_feather' instead.
        #> user system elapsed
        #> 56.95 0.08 57.13

      dummy_txt <- mutate(dummy, points = map_chr(points, deparse))

      1. orders of magnitude faster
        system.time(write_arrow(dummy_txt, "dummytext.parquet"))
        #> Warning: Use 'write_ipc_stream' or 'write_feather' instead.
        #> user system elapsed
        #> 0.06 0.01 0.10
        ```

      <sup>Created on 2021-09-17 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>

      Interestingly the performance seems to degrade exponentially with the nesting level of the lists:

      ```r

      # add a level of nesting
      dummy2 <- tibble(
        points = rep(list(list(seq(6))), 2e6),
        index = seq(2e6)
      )

      # order of magnitude slower again, lost patience wating for it to return
      system.time(write_parquet(dummy2, "dummy2.parquet")
      ```

      This has implications for {sf} dataframes which use list columns to represent spatial data structures. Arrow/parquet are pretty much not viable for moderate to large spatial data in R:

      ```r

      1. options(timeout = 1000)

      remotes::install_github("wfmackey/absmapsdata")
      library(absmapsdata)

      1. doesn't return in a resonable amount of time
        write_arrow(absmapsdata::sa12016, "sa1.parquet")
      1. can use the same work around as above by converting geomtry to vector of well knowntext, but it takes time and bloats the files

       ```

      Possibly related to https://issues.apache.org/jira/browse/ARROW-12529 ?

       

      Attachments

        Issue Links

          Activity

            People

              jonkeane Jonathan Keane
              milesmcbain Miles McBain
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h
                  4h