Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
0.17.0
-
Linux OS with RHEL 7.7 distribution
Description
parquet.read_schema() fails when loading wide table schema created from Pandas DataFrame with 50,000 columns. This works ok using pyarrow 0.16.0.
import numpy as np import pandas as pd import pyarrow as pa import pyarrow.parquet as pq print(pa.__version__) df = pd.DataFrame(({'c' + str(i): np.random.randn(10) for i in range(50000)})) table = pa.Table.from_pandas(df) pq.write_table(table, "test_wide.parquet") schema = pq.read_schema('test_wide.parquet')
Output:
0.17.0
Traceback (most recent call last):
File "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3319, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-29-d5ef2df77263>", line 9, in <module>
table = pq.read_schema('test_wide.parquet')
File "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py", line 1793, in read_schema
return ParquetFile(where, memory_map=memory_map).schema.to_arrow_schema()
File "/GAAL/kisseri/conda_envs/blkmamba-dev/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in _init_
read_dictionary=read_dictionary, metadata=metadata)
File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: TProtocolException: Exceeded size limit
Attachments
Issue Links
- links to