Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.0.0, 3.0.0
-
None
-
None
-
pyarrow 3.0.0 / 2.0.0
pandas 1.1.5 / 1.2.1
smart_open 4.1.2
python 3.8.6
Description
When reading or writing a large parquet file, I have this error:
df: Final = pd.read_parquet(input_file_uri, engine="pyarrow") File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet return impl.read( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read return self.api.parquet.read_table( File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read return self.reader.read_all(column_indices=column_indices, File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648
Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses 31-bit computing. It's not even 32-bit as sizes are non-negative.
This problem started after I added a string column with 2.5 billion unique rows. Each value was effectively a unique base64 encoded length 24 string. Below is code to reproduce the issue:
from base64 import urlsafe_b64encode import numpy as np import pandas as pd import pyarrow as pa import smart_open def num_to_b64(num: int) -> str: return urlsafe_b64encode(num.to_bytes(16, "little")).decode() df = pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s") with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file: df.to_parquet(output_file, engine="pyarrow", compression="gzip", index=False)
The dataframe is created correctly. When attempting to write it as a parquet file, the last line of the above code leads to the error:
pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2500000000