[ARROW-11456] [Python] Parquet reader cannot read large strings - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0, 3.0.0
Fix Version/s: None
Component/s: Python
Labels:
None
Environment:
pyarrow 3.0.0 / 2.0.0
pandas 1.1.5 / 1.2.1
smart_open 4.1.2
python 3.8.6

External issue URL:
https://github.com/apache/arrow/issues/27341

Description

When reading or writing a large parquet file, I have this error:

    df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
  File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
    return impl.read(
  File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read
    return self.api.parquet.read_table(
  File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read
    return self.reader.read_all(column_indices=column_indices,
  File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648

Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses 31-bit computing. It's not even 32-bit as sizes are non-negative.

This problem started after I added a string column with 2.5 billion unique rows. Each value was effectively a unique base64 encoded length 24 string. Below is code to reproduce the issue:

from base64 import urlsafe_b64encode

import numpy as np
import pandas as pd
import pyarrow as pa
import smart_open

def num_to_b64(num: int) -> str:
    return urlsafe_b64encode(num.to_bytes(16, "little")).decode()

df = pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")

with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
    df.to_parquet(output_file, engine="pyarrow", compression="gzip", index=False)

The dataframe is created correctly. When attempting to write it as a parquet file, the last line of the above code leads to the error:

pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2500000000

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Pac A. He

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 01/Feb/21 16:48

Updated:: 11/Jan/23 08:19