Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3999

[Python] Can't read large file that pyarrow wrote

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 0.11.1
    • Fix Version/s: None
    • Component/s: Python
    • Labels:
      None
    • Environment:
      OS: OSX High Sierra 10.13.6
      Python: 3.7.0
      PyArrow: 0.11.1
      Pandas: 0.23.4

      Description

      I loaded a large Pandas DataFrame from a CSV and successfully wrote it to a Parquet file using the DataFrame's to_parquet method. However, reading that same file back results in an exception. The DataFrame consists of about 32 million rows with seven columns; four are ASCII text and three are booleans.

       

      >>> source_df.shape
      (32070402, 7)
      
      >>> source_df.dtypes
      Url Source object
      Url Destination object
      Anchor text object
      Follow / No-Follow object
      Link No-Follow bool
      Meta No-Follow bool
      Robot No-Follow bool
      dtype: object
      
      >>> source_df.to_parquet('export.parq', compression='gzip',
                               use_deprecated_int96_timestamps=True)
      
      >>> loaded_df = pd.read_parquet('export.parq')
      Traceback (most recent call last):
       File "<stdin>", line 1, in <module>
       File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 288, in read_parquet
         return impl.read(path, columns=columns, **kwargs)
       File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pandas/io/parquet.py", line 131, in read
         **kwargs).to_pandas()
       File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 1074, in read_table
         use_pandas_metadata=use_pandas_metadata)
       File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/filesystem.py", line 184, in read_parquet
         use_pandas_metadata=use_pandas_metadata)
       File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 943, in read
         use_pandas_metadata=use_pandas_metadata)
       File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 500, in read
         table = reader.read(**options)
       File "/Users/tux/.pyenv/versions/3.7.0/lib/python3.7/site-packages/pyarrow/parquet.py", line 187, in read
         use_threads=use_threads)
       File "pyarrow/_parquet.pyx", line 721, in pyarrow._parquet.ParquetReader.read_all
       File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
      pyarrow.lib.ArrowIOError: Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
      
      Arrow error: Capacity error: BinaryArray cannot contain more than 2147483646 bytes, have 2147483685
       

       

      One would expect that if PyArrow can write a file successfully, it can read it back as well. Fortunately the fastparquet library has no problem reading this file, so we didn't lose any data, but the roundtripping problem was a bit of a surprise.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              yiannisliodakis Diego Argueta
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: