[ARROW-3999] [Python] Can't read large file that pyarrow wrote - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Duplicate
Affects Version/s: 0.11.1
Fix Version/s: None
Component/s: Python
Labels:
None
Environment:
OS: OSX High Sierra 10.13.6
Python: 3.7.0
PyArrow: 0.11.1
Pandas: 0.23.4

External issue URL:
https://github.com/apache/arrow/issues/20601

Description

I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a Parquet file using the DataFrame's to_parquet method. However, reading that same file back results in an exception. The DataFrame consists of about 32 million rows with seven columns; four are ASCII text and three are booleans.

>>> source_df.shape
(32070402, 7)

>>> source_df.dtypes
Url Source object
Url Destination object
Anchor text object
Follow / No-Follow object
Link No-Follow bool
Meta No-Follow bool
Robot No-Follow bool
dtype: object

>>> source_df.to_parquet('export.parq', compression='gzip',
                         use_deprecated_int96_timestamps=True)

>>> loaded_df = pd.read_parquet('export.parq')
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 288, in read_parquet
   return impl.read(path, columns=columns, **kwargs)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
   **kwargs).to_pandas()
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 1074, in read_table
   use_pandas_metadata=use_pandas_metadata)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py", line 184, in read_parquet
   use_pandas_metadata=use_pandas_metadata)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 943, in read
   use_pandas_metadata=use_pandas_metadata)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 500, in read
   table = reader.read(**options)
 File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 187, in read
   use_threads=use_threads)
 File "pyarrow/_parquet.pyx", line 721, in pyarrow._parquet.ParquetReader.read_all
 File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685

Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685

One would expect that if PyArrow can write a file successfully, it can read it back as well. Fortunately the fastparquet library has no problem reading this file, so we didn't lose any data, but the roundtripping problem was a bit of a surprise.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Diego Argueta

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Dec/18 19:29

Updated:: 11/Jan/23 07:31

Resolved:: 11/Dec/18 19:44