Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.0.0
Description
I've found that reading in a parquet file one column at a time is slow in R – much slower than reading the whole all at once in R, or reading one column at a time in Python.
An example is below, though it's certainly possible I've done my benchmarking incorrectly.
Python setup and benchmarking:
import numpy as np import pyarrow import pyarrow.parquet as pq from numpy.random import default_rng from time import time # Create a large, random array to save. ~1.5 GB. rng = default_rng(seed = 1) n_col = 4000 n_row = 50000 mat = rng.standard_normal((n_col, n_row)) col_names = [str(nm) for nm in range(n_col)] tab = pyarrow.Table.from_arrays(mat, names=col_names) pq.write_table(tab, "test_tab.parquet", use_dictionary=False) # How long does it take to read the whole thing in python? time_start = time() _ = pq.read_table("test_tab.parquet") # edit: corrected filename elapsed = time() - time_start print(elapsed) # under 1 second on my computer time_start = time() f = pq.ParquetFile("test_tab.parquet") for one_col in col_names: _ = f.read(one_col).column(0) elapsed = time() - time_start print(elapsed) # about 2 seconds
R benchmarking, using the same test_tab.parquet file
library(arrow) read_by_column <- function(f) { table = ParquetFileReader$create(f) cols <- as.character(0:3999) purrr::walk(cols, ~table$ReadTable(.)$column(0)) } bench::mark( read_parquet("test_tab.parquet", as_data_frame=FALSE), # 0.6 s read_parquet("test_tab.parquet", as_data_frame=TRUE), # 1 s read_by_column("test_tab.parquet"), # 100 s check=FALSE )
Attachments
Attachments
Issue Links
- links to