Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Duplicate
-
0.11.1
-
None
-
None
-
OS: OSX High Sierra 10.13.6
Python: 3.7.0
PyArrow: 0.11.1
Pandas: 0.23.4
Description
I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a Parquet file using the DataFrame's to_parquet method. However, reading that same file back results in an exception. The DataFrame consists of about 32 million rows with seven columns; four are ASCII text and three are booleans.
>>> source_df.shape (32070402, 7) >>> source_df.dtypes Url Source object Url Destination object Anchor text object Follow / No-Follow object Link No-Follow bool Meta No-Follow bool Robot No-Follow bool dtype: object >>> source_df.to_parquet('export.parq', compression='gzip', use_deprecated_int96_timestamps=True) >>> loaded_df = pd.read_parquet('export.parq') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 288, in read_parquet return impl.read(path, columns=columns, **kwargs) File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read **kwargs).to_pandas() File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 1074, in read_table use_pandas_metadata=use_pandas_metadata) File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py", line 184, in read_parquet use_pandas_metadata=use_pandas_metadata) File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 943, in read use_pandas_metadata=use_pandas_metadata) File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 500, in read table = reader.read(**options) File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 187, in read use_threads=use_threads) File "pyarrow/_parquet.pyx", line 721, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685 Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
One would expect that if PyArrow can write a file successfully, it can read it back as well. Fortunately the fastparquet library has no problem reading this file, so we didn't lose any data, but the roundtripping problem was a bit of a surprise.