Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Cannot Reproduce
-
0.13.0, 0.14.0, 0.15.0
-
None
-
None
Description
I've noticed that reading from parquet using pandas read_parquet function is taking steadily longer with each invocation. I've seen the other ticket about memory usage but I'm seeing no memory impact just steadily increasing read time until I restart the python session.
Below is some code to reproduce my results. I notice it's particularly bad on wide matrices, especially using pyarrow==0.15.0
import pyarrow.parquet as pq import pyarrow as pa import pandas as pd import os import numpy as np import time file = "skinny_matrix.pq" if not os.path.isfile(file): mat = np.zeros((6000, 26000)) mat.ravel()[::100] = np.random.randn(60 * 26000) df = pd.DataFrame(mat.T) table = pa.Table.from_pandas(df) pq.write_table(table, file) n_timings = 50 timings = np.empty(n_timings) for i in range(n_timings): start = time.time() new_df = pd.read_parquet(file) end = time.time() timings[i] = end - start