Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6230

[R] Reading in Parquet files are 20x slower than reading fst files in R

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.14.0
    • 0.15.0
    • R
    • Windows 10 Pro and Ubuntu

    Description

      Problem

      Loading any of the data I mentioned below is 20x slower than the fst format in R.

       

      How to get the data

      https://loanperformancedata.fanniemae.com/lppub/index.html

      Register and download any of these. I can't provide the data to you, and I think it's best you register.

       

       

      Code

      ```r
      path = "data/Performance_2016Q4.txt"

      library(data.table)
      library(arrow)

      a = data.table::fread(path, header = FALSE)

      fst::write_fst(a, "data/a.fst")

      arrow::write_parquet(a, "data/a.parquet")

      rm(a); gc()

      #read in test
      system.time(a <- fst::read_fst("data/a.fst")) # 4.61 seconds

      rm(a); gc()

      read in test
      system.time(a <- arrow::read_parquet("data/a.parquet") # 99.19 seconds
      ```

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              xiaodai Zhuo Jia Dai
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: