[ARROW-9557] [R] Iterating over parquet columns is slow in R - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 2.0.0
Component/s: R
Labels:
- performance
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/25623

Description

I've found that reading in a parquet file one column at a time is slow in R – much slower than reading the whole all at once in R, or reading one column at a time in Python.

An example is below, though it's certainly possible I've done my benchmarking incorrectly.

Python setup and benchmarking:

import numpy as np
import pyarrow
import pyarrow.parquet as pq
from numpy.random import default_rng
from time import time

# Create a large, random array to save. ~1.5 GB.
rng = default_rng(seed = 1)
n_col = 4000
n_row = 50000

mat = rng.standard_normal((n_col, n_row))
col_names = [str(nm) for nm in range(n_col)]
tab = pyarrow.Table.from_arrays(mat, names=col_names)

pq.write_table(tab, "test_tab.parquet", use_dictionary=False)

# How long does it take to read the whole thing in python?
time_start = time()
_ = pq.read_table("test_tab.parquet") # edit: corrected filename
elapsed = time() - time_start
print(elapsed) # under 1 second on my computer


time_start = time()
f = pq.ParquetFile("test_tab.parquet")
for one_col in col_names:
    _ = f.read(one_col).column(0)

elapsed = time() - time_start
print(elapsed) # about 2 seconds

R benchmarking, using the same test_tab.parquet file

library(arrow)

read_by_column <- function(f) {
    table = ParquetFileReader$create(f)
    cols <- as.character(0:3999)
    purrr::walk(cols, ~table$ReadTable(.)$column(0))
}

bench::mark(
    read_parquet("test_tab.parquet", as_data_frame=FALSE), #   0.6 s
    read_parquet("test_tab.parquet", as_data_frame=TRUE),  #   1 s
    read_by_column("test_tab.parquet"),                    # 100 s
    check=FALSE
)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

profile_screenshot.png
28/Jul/20 15:07
107 kB
Karl Dunkle Werner

Issue Links

links to

GitHub Pull Request #8122

Activity

People

Assignee:: Romain Francois

Reporter:: Karl Dunkle Werner

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 25/Jul/20 20:42

Updated:: 11/Jan/23 08:07

Resolved:: 24/Sep/20 15:40

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

2h 20m