Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11456

[Python] Parquet reader cannot read large strings

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.0, 3.0.0
    • None
    • Python
    • None
    • pyarrow 3.0.0 / 2.0.0
      pandas 1.1.5 / 1.2.1
      smart_open 4.1.2
      python 3.8.6

    Description

      When reading or writing a large parquet file, I have this error:

          df: Final = pd.read_parquet(input_file_uri, engine="pyarrow")
        File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
          return impl.read(
        File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read
          return self.api.parquet.read_table(
        File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 1638, in read_table
          return dataset.read(columns=columns, use_threads=use_threads,
        File "/opt/conda/envs/condaenv/lib/python3.8/site-packages/pyarrow/parquet.py", line 327, in read
          return self.reader.read_all(column_indices=column_indices,
        File "pyarrow/_parquet.pyx", line 1126, in pyarrow._parquet.ParquetReader.read_all
        File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
      OSError: Capacity error: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2147483648
      

      Isn't pyarrow supposed to support large parquets? It let me write this parquet file, but now it doesn't let me read it back. I don't understand why arrow uses 31-bit computing. It's not even 32-bit as sizes are non-negative.

      This problem started after I added a string column with 2.5 billion unique rows. Each value was effectively a unique base64 encoded length 24 string. Below is code to reproduce the issue:

      from base64 import urlsafe_b64encode
      
      import numpy as np
      import pandas as pd
      import pyarrow as pa
      import smart_open
      
      def num_to_b64(num: int) -> str:
          return urlsafe_b64encode(num.to_bytes(16, "little")).decode()
      
      df = pd.Series(np.arange(2_500_000_000)).apply(num_to_b64).astype("string").to_frame("s")
      
      with smart_open.open("s3://mybucket/mydata.parquet", "wb") as output_file:
          df.to_parquet(output_file, engine="pyarrow", compression="gzip", index=False)
      

      The dataframe is created correctly. When attempting to write it as a parquet file, the last line of the above code leads to the error:

      pyarrow.lib.ArrowCapacityError: BinaryBuilder cannot reserve space for more than 2147483646 child elements, got 2500000000
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            apacman Pac A. He
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: