[ARROW-6985] [Python] Steadily increasing time to load file using read_parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Cannot Reproduce
Affects Version/s: 0.13.0, 0.14.0, 0.15.0
Fix Version/s: None
Component/s: Python
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/23299

Description

I've noticed that reading from parquet using pandas read_parquet function is taking steadily longer with each invocation. I've seen the other ticket about memory usage but I'm seeing no memory impact just steadily increasing read time until I restart the python session.

Below is some code to reproduce my results. I notice it's particularly bad on wide matrices, especially using pyarrow==0.15.0

import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import os
import numpy as np
import time

file = "skinny_matrix.pq"

if not os.path.isfile(file):
    mat = np.zeros((6000, 26000))
    mat.ravel()[::100] = np.random.randn(60 * 26000)
    df = pd.DataFrame(mat.T)
    table = pa.Table.from_pandas(df)
    pq.write_table(table, file)

n_timings = 50
timings = np.empty(n_timings)
for i in range(n_timings):
    start = time.time()
    new_df = pd.read_parquet(file)
    end = time.time()
    timings[i] = end - start

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

image-2019-10-25-14-54-32-583.png
25/Oct/19 13:54
35 kB
Casey
image-2019-10-25-14-53-37-623.png
25/Oct/19 13:53
33 kB
Casey
image-2019-10-25-14-52-46-165.png
25/Oct/19 13:52
30 kB
Casey

Activity

People

Assignee:: Unassigned

Reporter:: Casey

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 24/Oct/19 14:25

Updated:: 11/Jan/23 07:50

Resolved:: 18/Apr/20 23:27