Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.0
Description
I recently tried to update my pyarrow from 0.17.1 to 1.0.0 and I'm encountering a serious bug where wide DataFrames fail during pandas.read_parquet. Small parquet files (m=10000) read correctly, medium files (m=40000) fail with a "Bus Error: 10", and large files (m=100000) completely hang. I've tried python 3.8.5, pandas 1.0.5, pyarrow 1.0.0, and OSX 10.14.
The driver code and output is below:
import pandas as pd import numpy as np import sys filename = "test.parquet" n = 10 m = int(sys.argv[1]) print(m) x = np.zeros((n, m)) x = pd.DataFrame(x, columns=[f"A_{i}" for i in range(m)]) x.to_parquet(filename) y = pd.read_parquet(filename, engine='pyarrow')
time python test_pyarrow.py 10000 real 0m4.018s user 0m5.286s sys 0m0.514s time python test_pyarrow.py 40000 40000 Bus error: 10
On a pyarrow 0.17.1 environment, the 40,000 case completes in 8 seconds.
This was cross-posted on the pandas tracker as well: https://github.com/pandas-dev/pandas/issues/35846
Attachments
Issue Links
- links to