Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9557

[R] Iterating over parquet columns is slow in R

    XMLWordPrintableJSON

Details

    Description

      I've found that reading in a parquet file one column at a time is slow in R – much slower than reading the whole all at once in R, or reading one column at a time in Python.

      An example is below, though it's certainly possible I've done my benchmarking incorrectly.

       

      Python setup and benchmarking:

      import numpy as np
      import pyarrow
      import pyarrow.parquet as pq
      from numpy.random import default_rng
      from time import time
      
      # Create a large, random array to save. ~1.5 GB.
      rng = default_rng(seed = 1)
      n_col = 4000
      n_row = 50000
      
      mat = rng.standard_normal((n_col, n_row))
      col_names = [str(nm) for nm in range(n_col)]
      tab = pyarrow.Table.from_arrays(mat, names=col_names)
      
      pq.write_table(tab, "test_tab.parquet", use_dictionary=False)
      
      # How long does it take to read the whole thing in python?
      time_start = time()
      _ = pq.read_table("test_tab.parquet") # edit: corrected filename
      elapsed = time() - time_start
      print(elapsed) # under 1 second on my computer
      
      
      time_start = time()
      f = pq.ParquetFile("test_tab.parquet")
      for one_col in col_names:
          _ = f.read(one_col).column(0)
      
      elapsed = time() - time_start
      print(elapsed) # about 2 seconds
      
      
      

      R benchmarking, using the same test_tab.parquet file

      library(arrow)
      
      read_by_column <- function(f) {
          table = ParquetFileReader$create(f)
          cols <- as.character(0:3999)
          purrr::walk(cols, ~table$ReadTable(.)$column(0))
      }
      
      bench::mark(
          read_parquet("test_tab.parquet", as_data_frame=FALSE), #   0.6 s
          read_parquet("test_tab.parquet", as_data_frame=TRUE),  #   1 s
          read_by_column("test_tab.parquet"),                    # 100 s
          check=FALSE
      )
      
      

      Attachments

        1. profile_screenshot.png
          107 kB
          Karl Dunkle Werner

        Issue Links

          Activity

            People

              romainfrancois Romain Francois
              karldw Karl Dunkle Werner
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 20m
                  2h 20m